2004-11-03 12:51:44

by Gene Heskett

[permalink] [raw]
Subject: is killing zombies possible w/o a reboot?

Greetings;

I thought I'd get caught up on -bkx kernels and made a -bk8 just now.

But I'd tried to run gnomeradio earlier to listen to the elections,
but it failed leaving to run, as did tvtime then too, claiming it
couldn't get a lock on /dev/video0, and gnomeradio apparently left a
lock on alsasound that prevented the normal gracefull shutdown by
locking up the shutdown on the "stopping alsasound" line. So I had
to use the hardware reset.

I'd tried to kill the zombie earlier but couldn't.

Isn't there some way to clean up a &^$#^#@)_ zombie?

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.


2004-11-03 14:33:52

by bert hubert

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wed, Nov 03, 2004 at 07:51:39AM -0500, Gene Heskett wrote:

> But I'd tried to run gnomeradio earlier to listen to the elections,

Depressing enough.

> I'd tried to kill the zombie earlier but couldn't.
> Isn't there some way to clean up a &^$#^#@)_ zombie?

Kill the parent, is the only (portable) way.

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2004-11-03 14:49:51

by Måns Rullgård

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

bert hubert <[email protected]> writes:

> On Wed, Nov 03, 2004 at 07:51:39AM -0500, Gene Heskett wrote:
>
>> But I'd tried to run gnomeradio earlier to listen to the elections,
>
> Depressing enough.
>
>> I'd tried to kill the zombie earlier but couldn't.
>> Isn't there some way to clean up a &^$#^#@)_ zombie?
>
> Kill the parent, is the only (portable) way.

Perhaps not as portable, but another possible, though slightly
complicated, way is to ptrace the parent and force it to wait().

--
M?ns Rullg?rd
[email protected]

2004-11-03 15:30:32

by Måns Rullgård

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

DervishD <[email protected]> writes:

> Hi all :)
>
> * M?ns Rullg?rd <[email protected]> dixit:
>> >> I'd tried to kill the zombie earlier but couldn't.
>> >> Isn't there some way to clean up a &^$#^#@)_ zombie?
>> > Kill the parent, is the only (portable) way.
>> Perhaps not as portable, but another possible, though slightly
>> complicated, way is to ptrace the parent and force it to wait().
>
> Or write a little program that just 'wait()'s for the specified
> PID's. That is perfectly portable IMHO. But I must admit that the
> preferred way should be killing the parent. 'init' will reap the
> children after that.

You can only wait() for your own children.

--
M?ns Rullg?rd
[email protected]

2004-11-03 15:34:16

by DervishD

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Hi all :)

* M?ns Rullg?rd <[email protected]> dixit:
> >> I'd tried to kill the zombie earlier but couldn't.
> >> Isn't there some way to clean up a &^$#^#@)_ zombie?
> > Kill the parent, is the only (portable) way.
> Perhaps not as portable, but another possible, though slightly
> complicated, way is to ptrace the parent and force it to wait().

Or write a little program that just 'wait()'s for the specified
PID's. That is perfectly portable IMHO. But I must admit that the
preferred way should be killing the parent. 'init' will reap the
children after that.

Ra?l N??ez de Arenas Coronado

--
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

2004-11-03 16:24:25

by Gene Heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wednesday 03 November 2004 09:33, bert hubert wrote:
>On Wed, Nov 03, 2004 at 07:51:39AM -0500, Gene Heskett wrote:
>> But I'd tried to run gnomeradio earlier to listen to the
>> elections,
>
>Depressing enough.
>
>> I'd tried to kill the zombie earlier but couldn't.
>> Isn't there some way to clean up a &^$#^#@)_ zombie?
>
>Kill the parent, is the only (portable) way.

The parent would have been the icon. It opened its usual sized small
window, but never did anything to it. I clicked on closing the
window, but 10 seconds later the system asked me if I wanted to kill
it as it wasn't responding. I said yes, the window disappeared, but
kpm said gomeradio was still present as process 8162, and that wasn't
killable. Funny thing is, on the reboot, it automaticly self
restored and ran just fine.

I consider this as one of linux's achilles heels. Such a hung and
dead process can be properly disposed of by a primitive os called os9
because it keeps track of all resources in tables in the kernel
memory space. Issueing a kill procnumber removes the process from
the exec queue, reclaims all its memory to the system free memory
pool, and removes it from the IRQ service tables if an entry exists
there. Near instant, total cleanup, nothing left, in about 250
microseconds max. 1.79 mhz cpu's aren't quite instant :)

Lets just say that I think having to reboot because of a zombie that
has resources locked up, and have the reboot fubared by it too,
aren't exactly friendly actions.

I fully realise that linux has a much more complex method of
allocating resources, but doesn't it *know* exactly what resources
have been passed out to each process?

And why is there no entry from the kill function into that resource
management portion of the kernel so that this could also be done by
the linux kernel, say with a "kill --total procnumber"?

Seems like a heck of a good question to me since an os written to run
on a 64k machine in 1981, and expanded to run on a 128K to 2 megabyte
machine in 1986 can do it just fine. Even if that process is still
running and spitting out data to its parent window/shell! Or if its
crashed and scribbled over all its memory, makes no difference to
os9. You (root) wants it gone, fine, its gone.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-11-03 16:39:50

by Gene Heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wednesday 03 November 2004 09:49, M?ns Rullg?rd wrote:
>bert hubert <[email protected]> writes:
>> On Wed, Nov 03, 2004 at 07:51:39AM -0500, Gene Heskett wrote:
>>> But I'd tried to run gnomeradio earlier to listen to the
>>> elections,
>>
>> Depressing enough.
>>
>>> I'd tried to kill the zombie earlier but couldn't.
>>> Isn't there some way to clean up a &^$#^#@)_ zombie?
>>
>> Kill the parent, is the only (portable) way.
>
>Perhaps not as portable, but another possible, though slightly
>complicated, way is to ptrace the parent and force it to wait().

No deal. No way. The user needs something to clean up when he clicks
on an icon, and things go to hell in a handbasket. He has no advance
warning available to him to tell him he had better ptrace this one
that I'm aware of.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-11-03 16:48:53

by Gene Heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wednesday 03 November 2004 10:25, DervishD wrote:
> Hi all :)
>
> * M?ns Rullg?rd <[email protected]> dixit:
>> >> I'd tried to kill the zombie earlier but couldn't.
>> >> Isn't there some way to clean up a &^$#^#@)_ zombie?
>> >
>> > Kill the parent, is the only (portable) way.
>>
>> Perhaps not as portable, but another possible, though slightly
>> complicated, way is to ptrace the parent and force it to wait().
>
> Or write a little program that just 'wait()'s for the specified
>PID's. That is perfectly portable IMHO. But I must admit that the
>preferred way should be killing the parent. 'init' will reap the
>children after that.

But what if there is no parent, since the system has already disposed
of it?

There was no parent visible to kpm. Unforch kpm also doesn't
specificaly mark zombies as such either, so its a bit clueless in
that regard. Finding them is usually an exersize in stretching the
top window out till its about 20 screens high as its always going to
be at the bottom of the list.

If init can indeed do the cleanup, then how hard is it to have a "kill
--total procnumber" pass that info into init and let it do its thing?

Or better yet, when X asks me if I want it gone because its not
responding to the close button, have X do it all in one swell foop.

> Ra?l N??ez de Arenas Coronado

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-11-03 16:48:28

by linux-os

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wed, 3 Nov 2004, Gene Heskett wrote:

> On Wednesday 03 November 2004 09:33, bert hubert wrote:
>> On Wed, Nov 03, 2004 at 07:51:39AM -0500, Gene Heskett wrote:
>>> But I'd tried to run gnomeradio earlier to listen to the
>>> elections,
>>
>> Depressing enough.
>>
>>> I'd tried to kill the zombie earlier but couldn't.
>>> Isn't there some way to clean up a &^$#^#@)_ zombie?
>>
>> Kill the parent, is the only (portable) way.
>
> The parent would have been the icon. It opened its usual sized small
> window, but never did anything to it. I clicked on closing the
> window, but 10 seconds later the system asked me if I wanted to kill
> it as it wasn't responding. I said yes, the window disappeared, but
> kpm said gomeradio was still present as process 8162, and that wasn't
> killable. Funny thing is, on the reboot, it automaticly self
> restored and ran just fine.
>
> I consider this as one of linux's achilles heels. Such a hung and
> dead process can be properly disposed of by a primitive os called os9
> because it keeps track of all resources in tables in the kernel
> memory space. Issueing a kill procnumber removes the process from
> the exec queue, reclaims all its memory to the system free memory
> pool, and removes it from the IRQ service tables if an entry exists
> there. Near instant, total cleanup, nothing left, in about 250
> microseconds max. 1.79 mhz cpu's aren't quite instant :)
>
> Lets just say that I think having to reboot because of a zombie that
> has resources locked up, and have the reboot fubared by it too,
> aren't exactly friendly actions.

[SNIPPED....]

There is no problem killing a task and freeing its resources.
The problem is that Linux and other Unix variations need to
do this in a specific manner. That manner being that some
parent (or ultimately init) needs to receive the terminating
status. A task that has been otherwise killed, but is awaiting
its status to be obtained is in the 'Z' or zombie state. If
the code for either the child task or its parent was improperly
written, the death of a parent could allow a child to wait
forever (zombie).

The fix is to fix the code. Your temporary fix is to use
Ctrl-Alt-backspace to kill the X11 server (the parent).
If it doesn't restart (it's not a kernel problem, it's
a distribution problem), you can log in as root and
execute:

/etc/X11/prefdm &

All these little windows and icons are the 'children' of
the X server. The above is a temporary work-around for
a non-kernel problem.


Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by John Ashcroft.
98.36% of all statistics are fiction.

2004-11-03 17:44:54

by DervishD

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Hi Gene :)

* Gene Heskett <[email protected]> dixit:
> > Or write a little program that just 'wait()'s for the specified
> >PID's. That is perfectly portable IMHO. But I must admit that the
> >preferred way should be killing the parent. 'init' will reap the
> >children after that.
> But what if there is no parent, since the system has already disposed
> of it?

Then the children are reparented to 'init' and 'init' gets rid of
them. That's the way UNIX behaves.

Ra?l N??ez de Arenas Coronado

--
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

2004-11-03 17:47:45

by DervishD

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Hi M?ns :)

* M?ns Rullg?rd <[email protected]> dixit:
> >> >> I'd tried to kill the zombie earlier but couldn't.
> >> >> Isn't there some way to clean up a &^$#^#@)_ zombie?
> >> > Kill the parent, is the only (portable) way.
> >> Perhaps not as portable, but another possible, though slightly
> >> complicated, way is to ptrace the parent and force it to wait().
> > Or write a little program that just 'wait()'s for the specified
> > PID's. That is perfectly portable IMHO. But I must admit that the
> > preferred way should be killing the parent. 'init' will reap the
> > children after that.
> You can only wait() for your own children.

Yes, you will receive 'ECHILD', I didn't remember that, sorry.
Anyway, you shouldn't need to do that, since those zombies should
have been reparented to 'init'.

But, since SUSv3 doesn't specify which PID should be the parent
when doing the reparenting, PID 0 could be used when reparenting as a
way of telling the kernel "hey, rip those processes". Anyway, since
the kernel does the reparenting, the kernel could get rid of zombies.
I don't really know why is 'init' (PID 1) responsible of this.

Ra?l N??ez de Arenas Coronado

--
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

2004-11-03 18:53:45

by Gene Heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wednesday 03 November 2004 12:44, DervishD wrote:
> Hi Gene :)
>
> * Gene Heskett <[email protected]> dixit:
>> > Or write a little program that just 'wait()'s for the
>> > specified PID's. That is perfectly portable IMHO. But I must
>> > admit that the preferred way should be killing the parent.
>> > 'init' will reap the children after that.
>>
>> But what if there is no parent, since the system has already
>> disposed of it?
>
> Then the children are reparented to 'init' and 'init' gets rid
> of them. That's the way UNIX behaves.

Unforch, I've *never* had it work that way. Any dead process I've
ever had while running linux has only been disposable by a reboot.

> Ra?l N??ez de Arenas Coronado

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-11-03 19:01:33

by Douglas McNaught

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Gene Heskett <[email protected]> writes:

> On Wednesday 03 November 2004 12:44, DervishD wrote:

>> Then the children are reparented to 'init' and 'init' gets rid
>> of them. That's the way UNIX behaves.
>
> Unforch, I've *never* had it work that way. Any dead process I've
> ever had while running linux has only been disposable by a reboot.

Then it's either (a) not actually a zombie (perhaps stuck in D state),
or (b) its parent is still alive.

A zombie process is just an entry in the process table where the exit
status etc are stored until the parent reaps it--all other resources
(memory, FDs etc) have been released. So if your "zombie" process is
actually taking up resources (which I think you said in an earlier
post), there's something else at work.

-Doug

2004-11-03 19:05:00

by Måns Rullgård

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Gene Heskett <[email protected]> writes:

> On Wednesday 03 November 2004 12:44, DervishD wrote:
>> Hi Gene :)
>>
>> * Gene Heskett <[email protected]> dixit:
>>> > Or write a little program that just 'wait()'s for the
>>> > specified PID's. That is perfectly portable IMHO. But I must
>>> > admit that the preferred way should be killing the parent.
>>> > 'init' will reap the children after that.
>>>
>>> But what if there is no parent, since the system has already
>>> disposed of it?
>>
>> Then the children are reparented to 'init' and 'init' gets rid
>> of them. That's the way UNIX behaves.
>
> Unforch, I've *never* had it work that way. Any dead process I've
> ever had while running linux has only been disposable by a reboot.

That's because its parent was still sitting around refusing to wait()
for them.

--
M?ns Rullg?rd
[email protected]

2004-11-03 19:08:43

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wed, 03 Nov 2004 13:53:39 EST, Gene Heskett said:
> On Wednesday 03 November 2004 12:44, DervishD wrote:

> > Then the children are reparented to 'init' and 'init' gets rid
> > of them. That's the way UNIX behaves.
>
> Unforch, I've *never* had it work that way. Any dead process I've
> ever had while running linux has only been disposable by a reboot.

The problem likely isn't the true "zombie" - the only thing that *those*
processes have left is a process table entry to save the exit code for a wait()
syscall that might not happen anytime soon. And unless you have hundreds
of them sitting around causing pressure on the 32K process limit, they're
probably not a big problem.

More likely, what you're looking at is some process that has gone down into the
kernel on some syscall or other and gotten blocked. Since signals aren't
delivered until it returns, it ends up "unkillable".

Traditionally, a common cause for such wedging was a lost/misplaced interrupt
from an I/O operation, so a read()/write()/ioctl() call wouldn't return because
the device hadn't reported it completed. (tape drives were notorious for this).
Often, power-cycling the I/O device would cause an unsolicited interrupt to be
generated, which would clear the "waiting for interrupt" issue and allow the
process to return....


Attachments:
(No filename) (226.00 B)

2004-11-03 19:14:07

by Gene Heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wednesday 03 November 2004 11:46, linux-os wrote:
>On Wed, 3 Nov 2004, Gene Heskett wrote:
>> On Wednesday 03 November 2004 09:33, bert hubert wrote:
>>> On Wed, Nov 03, 2004 at 07:51:39AM -0500, Gene Heskett wrote:
>>>> But I'd tried to run gnomeradio earlier to listen to the
>>>> elections,
>>>
>>> Depressing enough.
>>>
>>>> I'd tried to kill the zombie earlier but couldn't.
>>>> Isn't there some way to clean up a &^$#^#@)_ zombie?
>>>
>>> Kill the parent, is the only (portable) way.
>>
>> The parent would have been the icon. It opened its usual sized
>> small window, but never did anything to it. I clicked on closing
>> the window, but 10 seconds later the system asked me if I wanted
>> to kill it as it wasn't responding. I said yes, the window
>> disappeared, but kpm said gomeradio was still present as process
>> 8162, and that wasn't killable. Funny thing is, on the reboot, it
>> automaticly self restored and ran just fine.
>>
>> I consider this as one of linux's achilles heels. Such a hung and
>> dead process can be properly disposed of by a primitive os called
>> os9 because it keeps track of all resources in tables in the
>> kernel memory space. Issueing a kill procnumber removes the
>> process from the exec queue, reclaims all its memory to the system
>> free memory pool, and removes it from the IRQ service tables if an
>> entry exists there. Near instant, total cleanup, nothing left, in
>> about 250 microseconds max. 1.79 mhz cpu's aren't quite instant :)
>>
>> Lets just say that I think having to reboot because of a zombie
>> that has resources locked up, and have the reboot fubared by it
>> too, aren't exactly friendly actions.
>
>[SNIPPED....]
>
>There is no problem killing a task and freeing its resources.
>The problem is that Linux and other Unix variations need to
>do this in a specific manner. That manner being that some
>parent (or ultimately init) needs to receive the terminating
>status. A task that has been otherwise killed, but is awaiting
>its status to be obtained is in the 'Z' or zombie state. If
>the code for either the child task or its parent was improperly
>written, the death of a parent could allow a child to wait
>forever (zombie).
>
>The fix is to fix the code.

In other words, its gnomeradio that needs fixed then?

Its the best 'radio' proggy I've run across that works with my
hardware, but I'm not sure it has a support person at ths late date.
Its probably not been touched in 2 years. Kde doesn't appear to have
a similar util that I've run across in the menu's so far, and its
3.3.0 here.

All of which seems to be dancing around the real problem though.
There seems to be no handy (to the user) path into the kernel to
allow such a killing unconditionally function. root should have that
ability.

>Your temporary fix is to use
>Ctrl-Alt-backspace to kill the X11 server (the parent).

The logout took about 2 minutes because X couldn't clear itself
either.

>If it doesn't restart (it's not a kernel problem, it's
>a distribution problem), you can log in as root and
>execute:
>
> /etc/X11/prefdm &

I'll try that next time.

>All these little windows and icons are the 'children' of
>the X server. The above is a temporary work-around for
>a non-kernel problem.

But a problem the kernel really should be capable of handling
transparently.
>
>Cheers,
>Dick Johnson
>Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
> Notice : All mail here is now cached for review by John Ashcroft.

What on earth for? I don't issue anything he would be interested in
except the first part of my sig. And thats been in my sig for a year
or so, and will stay there till the so-called Patriot Act is
repealed. John Ashcroft has done more damage to democracy
single-handedly because of his paranoia than any other 20 men in our
history. G. Washington certainly wouldn't have tolerated such a
person in his 1st term of government.

Depending on the mailing list, data here has a lifetime as short as 30
days.

Sorry about spilling politics into the list folks.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-11-03 19:25:24

by gene heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wednesday 03 November 2004 14:03, M?ns Rullg?rd wrote:
>Gene Heskett <[email protected]> writes:
>> On Wednesday 03 November 2004 12:44, DervishD wrote:
>>> Hi Gene :)
>>>
>>> * Gene Heskett <[email protected]> dixit:
>>>> > Or write a little program that just 'wait()'s for the
>>>> > specified PID's. That is perfectly portable IMHO. But I must
>>>> > admit that the preferred way should be killing the parent.
>>>> > 'init' will reap the children after that.
>>>>
>>>> But what if there is no parent, since the system has already
>>>> disposed of it?
>>>
>>> Then the children are reparented to 'init' and 'init' gets rid
>>> of them. That's the way UNIX behaves.
>>
>> Unforch, I've *never* had it work that way. Any dead process I've
>> ever had while running linux has only been disposable by a reboot.
>
>That's because its parent was still sitting around refusing to
> wait() for them.

Define 'parent' when it was a click on the apps icon on the xwindow
screen that started it, please.

--
Cheers, gene
gheskett at wdtv dot com
99.28% setiathome rank, not too bad for a WV hillbilly

2004-11-03 19:29:22

by DervishD

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Hi Gene :)

* Gene Heskett <[email protected]> dixit:
> > Then the children are reparented to 'init' and 'init' gets rid
> > of them. That's the way UNIX behaves.
> Unforch, I've *never* had it work that way. Any dead process I've
> ever had while running linux has only been disposable by a reboot.

Well, you know, shit happens... Anyway, could you define 'dead'?
Because if you're talking about zombies whose parent dies, they're
killable easily: just wait until init reaps them (usually in less
than 5 minutes since they dead). If you are talking about zombies who
has their parent alive, then it's a bug in the application, not the
kernel. In fact I wouldn't like if the kernel reaps my children
before I do, just in case I want to do something.

If you're talking about unkillable processes (those stuck in
disk-sleep state), you're right: only rebooting can kill them
(although sometimes they go out of D state and die normally). Bad
luck for you if any dead process you've ever had while running linux
has been of this kind :(

Ra?l N??ez de Arenas Coronado

--
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

2004-11-03 19:29:22

by gene heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wednesday 03 November 2004 14:06, [email protected] wrote:
>On Wed, 03 Nov 2004 13:53:39 EST, Gene Heskett said:
>> On Wednesday 03 November 2004 12:44, DervishD wrote:
>> > Then the children are reparented to 'init' and 'init' gets
>> > rid of them. That's the way UNIX behaves.
>>
>> Unforch, I've *never* had it work that way. Any dead process I've
>> ever had while running linux has only been disposable by a reboot.
>
>The problem likely isn't the true "zombie" - the only thing that
> *those* processes have left is a process table entry to save the
> exit code for a wait() syscall that might not happen anytime soon.
> And unless you have hundreds of them sitting around causing
> pressure on the 32K process limit, they're probably not a big
> problem.
>
>More likely, what you're looking at is some process that has gone
> down into the kernel on some syscall or other and gotten blocked.
> Since signals aren't delivered until it returns, it ends up
> "unkillable".
>
>Traditionally, a common cause for such wedging was a lost/misplaced
> interrupt from an I/O operation, so a read()/write()/ioctl() call
> wouldn't return because the device hadn't reported it completed.
> (tape drives were notorious for this). Often, power-cycling the I/O
> device would cause an unsolicited interrupt to be generated, which
> would clear the "waiting for interrupt" issue and allow the process
> to return....

Well, since the "device", a bt878 based Haupagge tv card is sitting in
a pci socket, thats even more drastic than a reboot.

--
Cheers, gene
gheskett at wdtv dot com
99.28% setiathome rank, not too bad for a WV hillbilly

2004-11-03 19:35:13

by Douglas McNaught

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Gene Heskett <[email protected]> writes:

> On Wednesday 03 November 2004 14:03, M?ns Rullg?rd wrote:

>>
>>That's because its parent was still sitting around refusing to
>> wait() for them.
>
> Define 'parent' when it was a click on the apps icon on the xwindow
> screen that started it, please.

Whichever process called fork() to create the app process is the
parent. Sounds like it's some component of the desktop environment.

-Doug

2004-11-03 19:38:07

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wed, 03 Nov 2004 14:26:23 EST, Gene Heskett said:

> Well, since the "device", a bt878 based Haupagge tv card is sitting in
> a pci socket, thats even more drastic than a reboot.

Not if you have a good hot-swap PCI cage. ;)

Anyhow, that points even more at a driver issue for the bt878 -
if you can get Sysrq-T output, where does it say the hung process is
inside the kernel?


Attachments:
(No filename) (226.00 B)

2004-11-03 19:38:07

by Måns Rullgård

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Gene Heskett <[email protected]> writes:

> On Wednesday 03 November 2004 14:03, M?ns Rullg?rd wrote:
>>Gene Heskett <[email protected]> writes:
>>> On Wednesday 03 November 2004 12:44, DervishD wrote:
>>>> Hi Gene :)
>>>>
>>>> * Gene Heskett <[email protected]> dixit:
>>>>> > Or write a little program that just 'wait()'s for the
>>>>> > specified PID's. That is perfectly portable IMHO. But I must
>>>>> > admit that the preferred way should be killing the parent.
>>>>> > 'init' will reap the children after that.
>>>>>
>>>>> But what if there is no parent, since the system has already
>>>>> disposed of it?
>>>>
>>>> Then the children are reparented to 'init' and 'init' gets rid
>>>> of them. That's the way UNIX behaves.
>>>
>>> Unforch, I've *never* had it work that way. Any dead process I've
>>> ever had while running linux has only been disposable by a reboot.
>>
>>That's because its parent was still sitting around refusing to
>> wait() for them.
>
> Define 'parent' when it was a click on the apps icon on the xwindow
> screen that started it, please.

Run "ps axf".

--
M?ns Rullg?rd
[email protected]

2004-11-03 19:56:18

by DervishD

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Hi Gene :)

* Gene Heskett <[email protected]> dixit:
> >Traditionally, a common cause for such wedging was a lost/misplaced
> > interrupt from an I/O operation, so a read()/write()/ioctl() call
> > wouldn't return because the device hadn't reported it completed.
> > (tape drives were notorious for this). Often, power-cycling the I/O
> > device would cause an unsolicited interrupt to be generated, which
> > would clear the "waiting for interrupt" issue and allow the process
> > to return....
> Well, since the "device", a bt878 based Haupagge tv card is sitting in
> a pci socket, thats even more drastic than a reboot.

Do you mean your Hauppage got stuck in disk-sleep state? Wow,
that's sound *weird*...

I think that the parent (which is whatever process did the fork
when you clicked your mouse) is still alive and forgetting to do the
'wait()' for its children.

Ra?l N??ez de Arenas Coronado

--
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

2004-11-03 20:00:51

by Måns Rullgård

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

linux-os <[email protected]> writes:

> The fix is to fix the code. Your temporary fix is to use
> Ctrl-Alt-backspace to kill the X11 server (the parent).

The X server is not the parent. The desktop manager (or whatever
those beasts are called) is more likely to be.

> All these little windows and icons are the 'children' of the X
> server.

The X server manages a set of windows, arranged in a logical tree
structure, with all windows ultimately descending from the root
windows. The parent-child relationships between windows should under
no circumstance be confused, or compared, with that between processes.
Any process, on any machine on the network, can, given enough
privileges, create subwindows of any window on the X server. Windows
and process belong to different worlds, the only connection between
which is that processes create windows, simply since anything that
happens in the computer is done by a process (or interrupt handler).

Am I really reading this on linux-kernel?

--
M?ns Rullg?rd
[email protected]

2004-11-03 20:07:38

by Helge Hafting

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wed, Nov 03, 2004 at 11:24:19AM -0500, Gene Heskett wrote:
> On Wednesday 03 November 2004 09:33, bert hubert wrote:
> >On Wed, Nov 03, 2004 at 07:51:39AM -0500, Gene Heskett wrote:
> >> But I'd tried to run gnomeradio earlier to listen to the
> >> elections,
> >
> >Depressing enough.
> >
> >> I'd tried to kill the zombie earlier but couldn't.
> >> Isn't there some way to clean up a &^$#^#@)_ zombie?
> >
> >Kill the parent, is the only (portable) way.
>
> The parent would have been the icon. It opened its usual sized small
> window, but never did anything to it. I clicked on closing the
> window, but 10 seconds later the system asked me if I wanted to kill
> it as it wasn't responding. I said yes, the window disappeared, but
> kpm said gomeradio was still present as process 8162, and that wasn't
> killable. Funny thing is, on the reboot, it automaticly self
> restored and ran just fine.
>
> I consider this as one of linux's achilles heels. Such a hung and
> dead process can be properly disposed of by a primitive os called os9
> because it keeps track of all resources in tables in the kernel
> memory space. Issueing a kill procnumber removes the process from
> the exec queue, reclaims all its memory to the system free memory
> pool, and removes it from the IRQ service tables if an entry exists
> there. Near instant, total cleanup, nothing left, in about 250
> microseconds max. 1.79 mhz cpu's aren't quite instant :)
>
Killing a process in linux with "kill -9 oid" also release all resources,
such as memory and file descriptors. The resource consumption of a
"zombie" is measured in bytes, not kilobytes.

> Lets just say that I think having to reboot because of a zombie that
> has resources locked up, and have the reboot fubared by it too,
> aren't exactly friendly actions.
>
Did you try logging out from the graphical user interface,
and then logging in again?
GUI programs are usually children of the window manager (or some
app launcher, all of these quit when you log out. A plain
zombie started from the GUI will disappear after that.

Only something stuck in a device driver will need the reboot,
but that tends to be a bug in the driver.
You can try unloading the driver module, but linux has a
nasty tendency to answer that with an OOPS or worse. When
something goes wrong - it does so properly and thourougly. :-)

> I fully realise that linux has a much more complex method of
> allocating resources, but doesn't it *know* exactly what resources
> have been passed out to each process?
>
Yes it does - the problem is that not all resources are managed
by processes. Some allocations are managed by drivers, so a driver
bug can get the device into a unuseable state _and_ tie up the
process(es) that were using the driver at the moment.

> And why is there no entry from the kill function into that resource
> management portion of the kernel so that this could also be done by
> the linux kernel, say with a "kill --total procnumber"?
>
Interesting, but you might need a path from "kill" into
every device driver. :-/ And of course it wtill won't work
if there is a bug in the driver.

> Seems like a heck of a good question to me since an os written to run
> on a 64k machine in 1981, and expanded to run on a 128K to 2 megabyte
> machine in 1986 can do it just fine. Even if that process is still
> running and spitting out data to its parent window/shell! Or if its
> crashed and scribbled over all its memory, makes no difference to
> os9. You (root) wants it gone, fine, its gone.
>
Can os9 do this if the process is busy calling into a buggy
device driver that simply doesn't return or perhaps believes
that some dma operation into process memory is taking forever?
Or perhaps os9 doesn't have lots and lots of drivers written by
different people with varying competence?

Often, the real solution is to fix the driver to deal with
"unexpected" conditions.

Helge Hafting

2004-11-03 20:12:43

by gene heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wednesday 03 November 2004 14:33, [email protected] wrote:
>On Wed, 03 Nov 2004 14:26:23 EST, Gene Heskett said:
>> Well, since the "device", a bt878 based Haupagge tv card is
>> sitting in a pci socket, thats even more drastic than a reboot.
>
>Not if you have a good hot-swap PCI cage. ;)
>
>Anyhow, that points even more at a driver issue for the bt878 -
>if you can get Sysrq-T output, where does it say the hung process is
>inside the kernel?

Thats another thing I've had compiled in since forever, but it so
seldom actually *works*, I've tended to forget about it.

--
Cheers, gene
gheskett at wdtv dot com
99.28% setiathome rank, not too bad for a WV hillbilly

2004-11-03 20:19:01

by Gene Heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wednesday 03 November 2004 14:26, DervishD wrote:
> Hi Gene :)
>
> * Gene Heskett <[email protected]> dixit:
>> > Then the children are reparented to 'init' and 'init' gets
>> > rid of them. That's the way UNIX behaves.
>>
>> Unforch, I've *never* had it work that way. Any dead process I've
>> ever had while running linux has only been disposable by a reboot.
>
> Well, you know, shit happens... Anyway, could you define 'dead'?
>Because if you're talking about zombies whose parent dies, they're
>killable easily: just wait until init reaps them (usually in less
>than 5 minutes since they dead). If you are talking about zombies
> who has their parent alive, then it's a bug in the application, not
> the kernel. In fact I wouldn't like if the kernel reaps my children
> before I do, just in case I want to do something.
>
> If you're talking about unkillable processes (those stuck in
>disk-sleep state), you're right: only rebooting can kill them
>(although sometimes they go out of D state and die normally). Bad
>luck for you if any dead process you've ever had while running linux
>has been of this kind :(
>
> Ra?l N??ez de Arenas Coronado

That seems to be the only kind of dead processes I get, and thats not
too often. Booted to 2.6.10-rc1-bk11 now, its all working just fine
except for on messydos patch that finally must have made it into the
tree.

As it appears I do not have a prayer of convincing folks otherwise
about this issue, I suggest we let this thread die a well deserved
death till it bites me or someone else again. I'll summerize that
os9/nitros9 handles this situation effortlessly and flawlessly, and I
expected a 150x more sophisticated os to do likewise. My mistake.
OTOH, its one hell of a versatile os IMNSHO. I'm not going away just
because it bites me occasionally.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-11-03 20:40:55

by Gene Heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wednesday 03 November 2004 15:13, Helge Hafting wrote:
>On Wed, Nov 03, 2004 at 11:24:19AM -0500, Gene Heskett wrote:
[...]
>> Lets just say that I think having to reboot because of a zombie
>> that has resources locked up, and have the reboot fubared by it
>> too, aren't exactly friendly actions.
>
>Did you try logging out from the graphical user interface,
>and then logging in again?

It took around 2 minutes for the logout of X to get back to a VC.
So obviously something slowed it down as thats a 4 second operation
here normally. And it didn't surprise me when the "reboot" shutdown
hung on "stopping alsasound" and I had to use the reset button.
[...]
>> I fully realise that linux has a much more complex method of
>> allocating resources, but doesn't it *know* exactly what resources
>> have been passed out to each process?
>
>Yes it does - the problem is that not all resources are managed
>by processes. Some allocations are managed by drivers, so a driver
>bug can get the device into a unuseable state _and_ tie up the
>process(es) that were using the driver at the moment.

This from my viewpoint, is wrong. The kernel, and only the kernel
should be ultimately responsible for handing out resources, and
reclaiming at its convienience.

>> And why is there no entry from the kill function into that
>> resource management portion of the kernel so that this could also
>> be done by the linux kernel, say with a "kill --total procnumber"?
>
>Interesting, but you might need a path from "kill" into
>every device driver. :-/ And of course it wtill won't work
>if there is a bug in the driver.

Thats the fault of the design IMO.

>> Seems like a heck of a good question to me since an os written to
>> run on a 64k machine in 1981, and expanded to run on a 128K to 2
>> megabyte machine in 1986 can do it just fine. Even if that
>> process is still running and spitting out data to its parent
>> window/shell! Or if its crashed and scribbled over all its
>> memory, makes no difference to os9. You (root) wants it gone,
>> fine, its gone.
>
>Can os9 do this if the process is busy calling into a buggy
>device driver that simply doesn't return or perhaps believes
>that some dma operation into process memory is taking forever?
>Or perhaps os9 doesn't have lots and lots of drivers written by
>different people with varying competence?

It did have quite a few authors involved in it over the years
including me, I did many of its utilities, and converted the rbf.mn
from 6809 code to 6309 code, roughly doubleing its speed without
fiddling with the clock speed, which is married to the video on that
machine. I also did a couple of its clock modules, which are the
heart of the multitasking it does. And yes, it could kill,
absolutely cleanly, any process you named on the command line at any
time. Any drivers involved got their scratch space from the callers
loading of a set of pointers, so if a driver was being accessed by 2
or more processes, each instance had its own stack/process space.
When the process disappeared, the recovery included that space in
memory. The driver proper had no long term history of that processes
actions, even if a disk seek microsleep or similar was in progress
when the caller disappeared.

>Often, the real solution is to fix the driver to deal with
>"unexpected" conditions.
>
>Helge Hafting

As I said earlier, lets let this horse be buried, "its dead Jim", and
my beating on it is only wasting bandwitdh.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-11-03 20:55:05

by Tom Felker

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wednesday 03 November 2004 06:51 am, Gene Heskett wrote:
> Greetings;
>
> I thought I'd get caught up on -bkx kernels and made a -bk8 just now.
>
> But I'd tried to run gnomeradio earlier to listen to the elections,
> but it failed leaving to run, as did tvtime then too, claiming it
> couldn't get a lock on /dev/video0, and gnomeradio apparently left a
> lock on alsasound that prevented the normal gracefull shutdown by
> locking up the shutdown on the "stopping alsasound" line. So I had
> to use the hardware reset.
>
> I'd tried to kill the zombie earlier but couldn't.
>
> Isn't there some way to clean up a &^$#^#@)_ zombie?

Ok, let me try to explain what probably happened.

First, terminology. When one process wants to be come two processes, it
fork()s. One process is the parent, and one it the child. The child usually
exec()s to become a different program. The parent sometimes wants to know
when the child ends and whether it succeeded. Thus, the wait() system calls.
The parent can either check whether a child died, or go to sleep until one
does. When the parent is awaken, it's told which child died and what the
child's exit status was (usually 0 for success). But if the child dies
before the parent wait()s, the kernel must keep a record of which child died
and what its exit status was, and it can't reassign the late child's PID yet.
This record is a "zombie," and shows up under top or ps with the 'Z' state.
Zombies do _not_ hold open files, memory, or resources of any kind.

That's the technical definition of a zombie, which I'm telling you because
that's probably not your situation: I assume you used "zombie" as an
informal term for a process that you can't kill. Your problem is a process
in uninterruptible sleep (the "D" state).

When a process executing in userspace wants information from a device, like a
disk or TV capture card, it calls read(), and context switches into kernel
space. Usually, it will take a moment for the data to be available from the
device, so the process gets put on a wait queue so other processes can run.
Obviously nothing is deallocated, because everyone expects the process will
get it's data and proceed as normal. When the device has the data, it
interrupts the CPU, and the kernel figures out who wanted the data and puts
them on the run queue.

When a process is on a wait queue waiting for data from a device (the D
state), it's impossible to kill. This is because otherwise, when the
interrupt did come, the structures associated with the process would have
been freed, and the kernel would crash. It would require an incredible
amount of innefficient bookkeeping to avoid this, and it's unnecessary
because normally, the data request will finish (successfully or not), and the
process will be woken up, or if it was sent SIGKILL, it will be killed.

Long story short, what happened was, some faulty hardware or some buggy
driver, probably associated with the capture card, had a problem and left the
process in D state. Thus, it couldn't be killed, and since it had /dev/video
open, tvtime couldn't run and failed gracefully, and because it held /dev/dsp
open, and couldn't be killed as the init scripts would normally do in that
situation, the audio drivers couldn't be unloaded and the boot process hung.

So give us a bunch of information about what hardware you're using, output of
dmesg, and steps to reproduce the driver bug (if it is that).
--
Tom Felker, <[email protected]>
<http://vlevel.sourceforge.net> - Stop fiddling with the volume knob.

If you have to design something and control freaks are involved, give them
plenty of knobs, but don't connect them to anything important.

2004-11-03 21:14:47

by Gene Heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wednesday 03 November 2004 15:48, Tom Felker wrote:
>On Wednesday 03 November 2004 06:51 am, Gene Heskett wrote:
>> Greetings;
>>
>> I thought I'd get caught up on -bkx kernels and made a -bk8 just
>> now.
>>
>> But I'd tried to run gnomeradio earlier to listen to the
>> elections, but it failed leaving to run, as did tvtime then too,
>> claiming it couldn't get a lock on /dev/video0, and gnomeradio
>> apparently left a lock on alsasound that prevented the normal
>> gracefull shutdown by locking up the shutdown on the "stopping
>> alsasound" line. So I had to use the hardware reset.
>>
>> I'd tried to kill the zombie earlier but couldn't.
>>
>> Isn't there some way to clean up a &^$#^#@)_ zombie?
>
>Ok, let me try to explain what probably happened.
>
>First, terminology. When one process wants to be come two
> processes, it fork()s. One process is the parent, and one it the
> child. The child usually exec()s to become a different program.
> The parent sometimes wants to know when the child ends and whether
> it succeeded. Thus, the wait() system calls. The parent can either
> check whether a child died, or go to sleep until one does. When
> the parent is awaken, it's told which child died and what the
> child's exit status was (usually 0 for success). But if the child
> dies before the parent wait()s, the kernel must keep a record of
> which child died and what its exit status was, and it can't
> reassign the late child's PID yet. This record is a "zombie," and
> shows up under top or ps with the 'Z' state. Zombies do _not_ hold
> open files, memory, or resources of any kind.
>
>That's the technical definition of a zombie, which I'm telling you
> because that's probably not your situation: I assume you used
> "zombie" as an informal term for a process that you can't kill.
> Your problem is a process in uninterruptible sleep (the "D" state).
>
>When a process executing in userspace wants information from a
> device, like a disk or TV capture card, it calls read(), and
> context switches into kernel space. Usually, it will take a moment
> for the data to be available from the device, so the process gets
> put on a wait queue so other processes can run. Obviously nothing
> is deallocated, because everyone expects the process will get it's
> data and proceed as normal. When the device has the data, it
> interrupts the CPU, and the kernel figures out who wanted the data
> and puts them on the run queue.
>
>When a process is on a wait queue waiting for data from a device
> (the D state), it's impossible to kill. This is because otherwise,
> when the interrupt did come, the structures associated with the
> process would have been freed, and the kernel would crash. It
> would require an incredible amount of innefficient bookkeeping to
> avoid this, and it's unnecessary because normally, the data request
> will finish (successfully or not), and the process will be woken
> up, or if it was sent SIGKILL, it will be killed.
>
>Long story short, what happened was, some faulty hardware or some
> buggy driver, probably associated with the capture card, had a
> problem and left the process in D state. Thus, it couldn't be
> killed, and since it had /dev/video open, tvtime couldn't run and
> failed gracefully, and because it held /dev/dsp open, and couldn't
> be killed as the init scripts would normally do in that situation,
> the audio drivers couldn't be unloaded and the boot process hung.
>
>So give us a bunch of information about what hardware you're using,
> output of dmesg, and steps to reproduce the driver bug (if it is
> that).

Its a dead horse Tom, lets bury it. I've rebooted to 4 new kernels
since that time as I march toward getting caught up with whatever
bk(nn) is out today. Other than that, which took place on bk7's
watch, its all working rather well.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-11-03 22:28:13

by Jim Nelson

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

DervishD wrote:
> Hi Gene :)
>
> * Gene Heskett <[email protected]> dixit:
>
>>> Then the children are reparented to 'init' and 'init' gets rid
>>>of them. That's the way UNIX behaves.
>>
>>Unforch, I've *never* had it work that way. Any dead process I've
>>ever had while running linux has only been disposable by a reboot.
>
>
> Well, you know, shit happens... Anyway, could you define 'dead'?
> Because if you're talking about zombies whose parent dies, they're
> killable easily: just wait until init reaps them (usually in less
> than 5 minutes since they dead). If you are talking about zombies who
> has their parent alive, then it's a bug in the application, not the
> kernel. In fact I wouldn't like if the kernel reaps my children
> before I do, just in case I want to do something.
>
> If you're talking about unkillable processes (those stuck in
> disk-sleep state), you're right: only rebooting can kill them
> (although sometimes they go out of D state and die normally). Bad
> luck for you if any dead process you've ever had while running linux
> has been of this kind :(
>

I did this to myself a number of times when I was first learning Samba - even an
ls would become unkillable. You couldn't rmmod smb, since it was in use, and you
couldn't kill the process, since it was waiting on a syscall. Ergh.

> Ra?l N??ez de Arenas Coronado
>

2004-11-03 22:48:50

by Russell Miller

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wednesday 03 November 2004 16:15, Jim Nelson wrote:

> I did this to myself a number of times when I was first learning Samba -
> even an ls would become unkillable. You couldn't rmmod smb, since it was
> in use, and you couldn't kill the process, since it was waiting on a
> syscall. Ergh.
>

I'm not going to pretend to be a kernel expert, or really anything other than
a newbie when it comes to kernel internals, so please take this with the
merits it deserves - many, or none, depending.

Anyway, is there a way to simply signal a syscall that it is to be interrupted
and forcibly cause the syscall to end? Kicking the program execution out of
kernel space would be sufficient to "unstick" the process - and coupling that
with an automatic KILL signal may not be a bad idea.

I'm pretty sure that someone will think of a way why this wouldn't work with
very little effort. Please enlighten me?

--Russell

--

Russell Miller - [email protected] - Le Mars, IA
Duskglow Consulting - Helping companies just like you to succeed for ~ 10 yrs.
http://www.duskglow.com - 712-546-5886

2004-11-03 23:01:49

by Bill Davidsen

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

DervishD wrote:
> Hi all :)
>
> * M?ns Rullg?rd <[email protected]> dixit:
>
>>>>I'd tried to kill the zombie earlier but couldn't.
>>>>Isn't there some way to clean up a &^$#^#@)_ zombie?
>>>
>>>Kill the parent, is the only (portable) way.
>>
>>Perhaps not as portable, but another possible, though slightly
>>complicated, way is to ptrace the parent and force it to wait().
>
>
> Or write a little program that just 'wait()'s for the specified
> PID's. That is perfectly portable IMHO. But I must admit that the
> preferred way should be killing the parent. 'init' will reap the
> children after that.

You can't wait() for the process, you have to use waitfor(), and the
last time I tried that it didn't work, although I don't remember the
symptom beyond that.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2004-11-03 23:11:12

by Vadim Lobanov

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Also a kernel newbie here, so apply appropriate amount of salt to
response. :)

One common scenario for why a program is blocked within a syscall is
that it is waiting for data to arrive. Consider, for example, a read()
on a file -- simplifying a lot, the data has to be fetched from disk,
which is slow. So, while the disk is doing it's thing, the program is
blocked within the system call. Then, when an interrupt arrives
signalling that the data is ready, it is placed into the user-space
buffer, and the program is kicked out of the syscall so that it can
continue executing.

Consider what happens if the program suddenly dies within the read()
syscall above: when the data from disk comes back, the kernel needs to
figure out where to put it. This would make for a very confused kernel,
since the original requester "vanished" without a trace. Even worse,
another program might have taken the original program's place in the
meantime! Very bad things happen.

This is certainly not an _impossible_ problem to solve (as far as I
know), but solving it in the general case would involve a lot of
expensive and complex book-keeping code, so it's simply not done.

Am I right? Wrong? Please enlighten me as well. :)

-Vadim Lobanov

On Wed, 3 Nov 2004, Russell Miller wrote:

> On Wednesday 03 November 2004 16:15, Jim Nelson wrote:
>
> > I did this to myself a number of times when I was first learning Samba -
> > even an ls would become unkillable. You couldn't rmmod smb, since it was
> > in use, and you couldn't kill the process, since it was waiting on a
> > syscall. Ergh.
> >
>
> I'm not going to pretend to be a kernel expert, or really anything other than
> a newbie when it comes to kernel internals, so please take this with the
> merits it deserves - many, or none, depending.
>
> Anyway, is there a way to simply signal a syscall that it is to be interrupted
> and forcibly cause the syscall to end? Kicking the program execution out of
> kernel space would be sufficient to "unstick" the process - and coupling that
> with an automatic KILL signal may not be a bad idea.
>
> I'm pretty sure that someone will think of a way why this wouldn't work with
> very little effort. Please enlighten me?
>
> --Russell
>
> --
>
> Russell Miller - [email protected] - Le Mars, IA
> Duskglow Consulting - Helping companies just like you to succeed for ~ 10 yrs.
> http://www.duskglow.com - 712-546-5886
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2004-11-03 23:14:57

by Douglas McNaught

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Russell Miller <[email protected]> writes:

> Anyway, is there a way to simply signal a syscall that it is to be
> interrupted and forcibly cause the syscall to end? Kicking the
> program execution out of kernel space would be sufficient to
> "unstick" the process - and coupling that with an automatic KILL
> signal may not be a bad idea.

It was already mentioned in this thread that the bookkeeping required
to clean up properly from such an abort would add a lot of overhead
and slow down the normal, non-buggy case.

-Doug

2004-11-03 23:35:15

by Russell Miller

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wednesday 03 November 2004 17:03, Doug McNaught wrote:

> It was already mentioned in this thread that the bookkeeping required
> to clean up properly from such an abort would add a lot of overhead
> and slow down the normal, non-buggy case.
>
I am going to continue pursuing this at the risk of making a bigger fool of
myself than I already am, but I want to make sure that I understand the
issues - and I did read the message you are referring to.

I think what you are saying is that there is kind of a race condition here.
When something is on the wait queue, it has to be followed through to
completion. An interrupt could be received at any time, and if it's taken
off of the wait queue prematurely, it'll crash the kernel, because the
interrupt has no way of telling that.

That's fine as it goes, I understand that. But I submit that this is a
horrible design. I've been bitten by this more than once - usually regarding
broken NFS connections.

But what I don't understand is why the bookkeeping would be so inefficient.
It seems to me that all that would be required is a bitfield of some sort.
If that position in the qait queue becomes invalid, when the interrupt is
received to process it, the kernel notes that a flag is set invalidating that
part of the wait queue, dumps the output to dave null, and goes on to the
next. This doesn't seem inefficient to me, unless I'm missing something.
A little more inefficient, yes, but not to near the cost that seems to be
implied.

And I also have to ask this question: what is more inefficient, slowing down
processing of output waiting on the queue, or having to reboot when a process
gets stuck due to faulty drivers? At the very least, a compile option seems
like it would be worthwhile for those that would like this behavior.

And I probably am. Missing something, that is.

--Russell

> -Doug

--

Russell Miller - [email protected] - Le Mars, IA
Duskglow Consulting - Helping companies just like you to succeed for ~ 10 yrs.
http://www.duskglow.com - 712-546-5886

2004-11-03 23:14:58

by Bill Davidsen

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

DervishD wrote:
> Hi Gene :)
>
> * Gene Heskett <[email protected]> dixit:
>
>>> Then the children are reparented to 'init' and 'init' gets rid
>>>of them. That's the way UNIX behaves.
>>
>>Unforch, I've *never* had it work that way. Any dead process I've
>>ever had while running linux has only been disposable by a reboot.
>
>
> Well, you know, shit happens... Anyway, could you define 'dead'?
> Because if you're talking about zombies whose parent dies, they're
> killable easily: just wait until init reaps them (usually in less
> than 5 minutes since they dead). If you are talking about zombies who
> has their parent alive, then it's a bug in the application, not the
> kernel. In fact I wouldn't like if the kernel reaps my children
> before I do, just in case I want to do something.
>
> If you're talking about unkillable processes (those stuck in
> disk-sleep state), you're right: only rebooting can kill them
> (although sometimes they go out of D state and die normally). Bad
> luck for you if any dead process you've ever had while running linux
> has been of this kind :(

That often seems to be the case, the kernel thinks there's an i/o going
on which isn't, and doesn't time it out. It would be nice if there were
a way to get the kernel to abort all outstanding i/o on kill -9, but I'm
sure if it were easy it would have happened. Timeouts in the application
are useful, but in some cases I believe the process dies because it
detects a long i/o time but has nothing to do but terminate, which
creates the zombie.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2004-11-03 23:44:22

by Bill Davidsen

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

DervishD wrote:
> Hi Gene :)
>
> * Gene Heskett <[email protected]> dixit:
>
>>>Traditionally, a common cause for such wedging was a lost/misplaced
>>>interrupt from an I/O operation, so a read()/write()/ioctl() call
>>>wouldn't return because the device hadn't reported it completed.
>>>(tape drives were notorious for this). Often, power-cycling the I/O
>>>device would cause an unsolicited interrupt to be generated, which
>>>would clear the "waiting for interrupt" issue and allow the process
>>>to return....
>>
>>Well, since the "device", a bt878 based Haupagge tv card is sitting in
>>a pci socket, thats even more drastic than a reboot.
>
>
> Do you mean your Hauppage got stuck in disk-sleep state? Wow,
> that's sound *weird*...
>
> I think that the parent (which is whatever process did the fork
> when you clicked your mouse) is still alive and forgetting to do the
> 'wait()' for its children.

It would be good to know what the PPID is, from ps or similar. Things
from X are a pain, the parent is often something you don't want to kill.
Sometimes you can reparent from command line, "bash -c foo&" or similar,
so the parent can be killed without logging out.

I would swear that the parent *is* init in some cases, which is puzzling
since they should be reaped.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2004-11-03 23:48:13

by Adam Heath

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wed, 3 Nov 2004, DervishD wrote:

> Hi all :)
>
> * M?ns Rullg?rd <[email protected]> dixit:
> > >> I'd tried to kill the zombie earlier but couldn't.
> > >> Isn't there some way to clean up a &^$#^#@)_ zombie?
> > > Kill the parent, is the only (portable) way.
> > Perhaps not as portable, but another possible, though slightly
> > complicated, way is to ptrace the parent and force it to wait().
>
> Or write a little program that just 'wait()'s for the specified
> PID's. That is perfectly portable IMHO. But I must admit that the
> preferred way should be killing the parent. 'init' will reap the
> children after that.

ptrace the parent, cause it to wait() for it's children, then change IP, etc.

2004-11-03 23:52:04

by Mathieu

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Russell Miller <[email protected]> disait derni?rement que :

> I am going to continue pursuing this at the risk of making a bigger fool of
> myself than I already am, but I want to make sure that I understand the
> issues - and I did read the message you are referring to.
>
> I think what you are saying is that there is kind of a race condition here.
> When something is on the wait queue, it has to be followed through to
> completion. An interrupt could be received at any time, and if it's taken
> off of the wait queue prematurely, it'll crash the kernel, because the
> interrupt has no way of telling that.
>
> That's fine as it goes, I understand that. But I submit that this is a
> horrible design. I've been bitten by this more than once - usually regarding
> broken NFS connections.

this is because nfs related syscalls are not interruptible by default.
you can make them interruptible by mounting your nfs's with the 'intr' option.

--
I love people saying 'we' even though they never contributed a single
line of code to the project!

- Jens Axboe turning a troll down on linux-kernel

2004-11-03 23:59:43

by Russell Miller

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wednesday 03 November 2004 17:47, Mathieu Segaud wrote:

> this is because nfs related syscalls are not interruptible by default.
> you can make them interruptible by mounting your nfs's with the 'intr'
> option.

That doesn't appear to work, then. Because we do mount them with the intr
option, and the behavior doesn't seem to be any different.

--Russell

--

Russell Miller - [email protected] - Le Mars, IA
Duskglow Consulting - Helping companies just like you to succeed for ~ 10 yrs.
http://www.duskglow.com - 712-546-5886

2004-11-04 00:12:18

by Mathieu

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Russell Miller <[email protected]> disait derni?rement que :

> On Wednesday 03 November 2004 17:47, Mathieu Segaud wrote:
>
>> this is because nfs related syscalls are not interruptible by default.
>> you can make them interruptible by mounting your nfs's with the 'intr'
>> option.
>
> That doesn't appear to work, then. Because we do mount them with the intr
> option, and the behavior doesn't seem to be any different.

weird, it works by here.... I can even umount() lost shares....

NFS is quite an unknown beast to me, sorry...
But it is clearly a bug, if you do mount them with -o intr...

--
<ajh> I always viewed HURD development like the Special Olympics of free software.

- Is Hurd a opponent to Linux?

2004-11-04 00:44:56

by Kurt Wall

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wed, Nov 03, 2004 at 03:40:03PM -0500, Gene Heskett took 89 lines to write:
> On Wednesday 03 November 2004 15:13, Helge Hafting wrote:
> >
> >Yes it does - the problem is that not all resources are managed
> >by processes. Some allocations are managed by drivers, so a driver
> >bug can get the device into a unuseable state _and_ tie up the
> >process(es) that were using the driver at the moment.
>
> This from my viewpoint, is wrong. The kernel, and only the kernel
> should be ultimately responsible for handing out resources, and
> reclaiming at its convienience.

This might just be semantics, but device drivers are part of the kernel.

Kurt
--
In 1750 Issac Newton became discouraged when he fell up a flight of
stairs.

2004-11-04 01:01:57

by Russell Miller

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wednesday 03 November 2004 18:43, Kurt Wall wrote:

> This might just be semantics, but device drivers are part of the kernel.
>
This brings up another question I've had since reading the documentation on
later pentium-class chips:

why are only rings 0 and 3 used in linux?

--Russell

> Kurt

--

Russell Miller - [email protected] - Le Mars, IA
Duskglow Consulting - Helping companies just like you to succeed for ~ 10 yrs.
http://www.duskglow.com - 712-546-5886

2004-11-04 01:18:43

by Michael Clark

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On 11/04/04 07:07, Bill Davidsen wrote:
> DervishD wrote:
>
>> Hi Gene :)
>>
>> * Gene Heskett <[email protected]> dixit:
>>
>>>> Then the children are reparented to 'init' and 'init' gets rid
>>>> of them. That's the way UNIX behaves.
>>>
>>>
>>> Unforch, I've *never* had it work that way. Any dead process I've
>>> ever had while running linux has only been disposable by a reboot.
>>
>>
>>
>> Well, you know, shit happens... Anyway, could you define 'dead'?
>> Because if you're talking about zombies whose parent dies, they're
>> killable easily: just wait until init reaps them (usually in less
>> than 5 minutes since they dead). If you are talking about zombies who
>> has their parent alive, then it's a bug in the application, not the
>> kernel. In fact I wouldn't like if the kernel reaps my children
>> before I do, just in case I want to do something.
>>
>> If you're talking about unkillable processes (those stuck in
>> disk-sleep state), you're right: only rebooting can kill them
>> (although sometimes they go out of D state and die normally). Bad
>> luck for you if any dead process you've ever had while running linux
>> has been of this kind :(
>
>
> That often seems to be the case, the kernel thinks there's an i/o going
> on which isn't, and doesn't time it out. It would be nice if there were
> a way to get the kernel to abort all outstanding i/o on kill -9, but I'm
> sure if it were easy it would have happened. Timeouts in the application
> are useful, but in some cases I believe the process dies because it
> detects a long i/o time but has nothing to do but terminate, which
> creates the zombie.

It could be any driver code that uses uninterruptible sleeps rather
than interruptible sleeps I believe. If a process is doing a read or
write to one of these devices and it stays stuck in kernel code with
TASK_UNINTERRUPTIBLE and never gets it's expected wake up, then the
signal will never be delivered and the process is stuck indefinately.
The buggy driver code needs to be fixed (either to use interruptible
sleeps and handle the signals or to imlement some sort of timeout).

~mc

2004-11-04 01:39:11

by Douglas McNaught

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Russell Miller <[email protected]> writes:

> This brings up another question I've had since reading the documentation on
> later pentium-class chips:
>
> why are only rings 0 and 3 used in linux?

Because the "traditional" Unix privilege model only has two levels,
and Linux runs on many architectures, most of which have only two
privilege levels (the 68000 called them "user" and "supervisor").
Special-casing x86 is possible but probably wouldn't be worth it.

-Doug

2004-11-04 01:41:58

by Russell Miller

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wednesday 03 November 2004 19:38, Doug McNaught wrote:
> Russell Miller <[email protected]> writes:
> > This brings up another question I've had since reading the documentation
> > on later pentium-class chips:
> >
> > why are only rings 0 and 3 used in linux?
>
> Because the "traditional" Unix privilege model only has two levels,
> and Linux runs on many architectures, most of which have only two
> privilege levels (the 68000 called them "user" and "supervisor").
> Special-casing x86 is possible but probably wouldn't be worth it.
>
Wouldn't it help with device driver problems? Couldn't ring 1 be used to make
sure an errant driver doesn't drop the kernel, at least on x86 machines?

I remember the 68000 architecture. Quite nice (but I was 10 when I studied
it, so..).

--Russell

> -Doug

--

Russell Miller - [email protected] - Le Mars, IA
Duskglow Consulting - Helping companies just like you to succeed for ~ 10 yrs.
http://www.duskglow.com - 712-546-5886

2004-11-04 02:15:06

by Mitchell Blank Jr

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Russell Miller wrote:
> Couldn't ring 1 be used to make
> sure an errant driver doesn't drop the kernel, at least on x86 machines?

Not really -- drivers could still do things like mis-program their associated
hardware making it do DMA writes all over kernel memory (just as one example)

Basically it'd add a lot of complexity (and inefficiency) without adding
much real safety.

-Mitch

2004-11-04 02:14:26

by Douglas McNaught

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Russell Miller <[email protected]> writes:

> Wouldn't it help with device driver problems? Couldn't ring 1 be
> used to make sure an errant driver doesn't drop the kernel, at least
> on x86 machines?

As I understand it:

1) Ring transitions aren't free.
2) The API between drivers and kernel is always in flux; drivers
expect to be able to access internal kernel data structures.
Making drivers run in ring 1 on even one of the N architectures
would be a major refactoring and would constrain API changes.
Freezing the internal API is something the developers don't want to
do.
3) There are probably plenty of ways for a buggy driver to crash the
kernel even if it's running in ring 1 (turn off interrupts and
leave them off, etc).

So the upshot is that it's probably not worth the work and portability
hassles.

-Doug

2004-11-04 06:40:44

by Denis Vlasenko

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thursday 04 November 2004 01:33, Russell Miller wrote:
> On Wednesday 03 November 2004 17:03, Doug McNaught wrote:
>
> > It was already mentioned in this thread that the bookkeeping required
> > to clean up properly from such an abort would add a lot of overhead
> > and slow down the normal, non-buggy case.
> >
> I am going to continue pursuing this at the risk of making a bigger fool of
> myself than I already am, but I want to make sure that I understand the
> issues - and I did read the message you are referring to.
>
> I think what you are saying is that there is kind of a race condition here.
> When something is on the wait queue, it has to be followed through to
> completion. An interrupt could be received at any time, and if it's taken
> off of the wait queue prematurely, it'll crash the kernel, because the
> interrupt has no way of telling that.

The problem is in locking. You must not kill process while it is
in uninterruptible state because it is uninterruptible
for a reason - has taken semaphore, or get_cpu(), etc.
You do want it to do put_cpu(), right?

Processes must never get stuck in D, it's a kernel bug.

Find out how did process ended up in D state forever,
and fix it - that's what I'm trying to do
in these cases.
--
vda

2004-11-04 07:25:31

by Jan Knutar

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wednesday 03 November 2004 23:08, Gene Heskett wrote:

> Its a dead horse Tom, lets bury it. I've rebooted to 4 new kernels
> since that time as I march toward getting caught up with whatever
> bk(nn) is out today. Other than that, which took place on bk7's
> watch, its all working rather well.

Since nobody else seems to have said it, it would be a good idea
to enable sysrq and do a sysrq-T the next time (if) this happens,
so that there would be atleast some information to go on.

2004-11-04 09:58:41

by Helge Hafting

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Russell Miller wrote:

>On Wednesday 03 November 2004 16:15, Jim Nelson wrote:
>
>
>
>>I did this to myself a number of times when I was first learning Samba -
>>even an ls would become unkillable. You couldn't rmmod smb, since it was
>>in use, and you couldn't kill the process, since it was waiting on a
>>syscall. Ergh.
>>
>>
>>
>
>I'm not going to pretend to be a kernel expert, or really anything other than
>a newbie when it comes to kernel internals, so please take this with the
>merits it deserves - many, or none, depending.
>
>Anyway, is there a way to simply signal a syscall that it is to be interrupted
>and forcibly cause the syscall to end?
>
There is a way. Processes go into D state happens all the time
when waiting for disk io or similiar. Then the io happens a few ms later,
and the fs or device driver tells the kernel to wake up the process
so it gets a chance at the next scheduling opportunity. So the mechanism to
unstick a prcess exists, and is used by every device driver that
use sleeping. Which is most of them.

Breakage happens when something never comes out of D-state.
One could write a trivial syscall (or addition to "kill") that "wakes"
processes waiting for io. It itsn't hard to do at all - just copy the
waking code from any device driver. This will allow to kill and
fully remove any process that hangs around in D-state. This might
also release other stuck resources as the syscall
continues, returns to userspace, and allows the process to die.

Unfortunately, this isn't enough. In some cases the syscall
expects the io device interrupt handler to have done something
vital - but this haven't happened when we forcibly wakes a process.
We can hope for an io error, but might get a crash instead. This
can be fixes with a lot of work - basically check at every wakeup
if the process were woken by this new killing mechanism and
act accordingly. It shouldn't be hard, but _lots_ of work
inspecting every sleeping point, at least every device driver.

Another problem exist if the long-waiting io wasn't lost - just
extremely slow.
If the io actually comes through after the process is gone and the memory
is used for something else - bang! Dealing properly with this case
is harder - a new generic mechanism for cancelling outstanding io
requests is needed for this.
It might even be impossible in some cases. If a memory address is handed
over to a bus-mastering device such as a scsi adapter, then the memory
must be pinned down until the operation completes. It cannot be released.
The rest of the process can go, but the hw might not support any way
of cancelling the request. A few may have a way, many won't. Some devices
can be reset - but at a considerable cost. A disk controller might be
unavailable
for seconds during such a reset - instant DOS attack if a user keeps
starting lots of
disk intensive processes and kill them off while in a D-state that
normally last way shorter than a reset. PCI devices can be turned off,
but we might really want
to use them again . . .

Fortunately, most cases of long-running D-state is just driver bugs and
can be fixed as such. nfs has a forced umount option. If samba can
hang, then
it _can_ be fixed in similiar ways. (smbfs is software only - no quirky
hw to deal with.)
Hw drivers that puts processes into everlasting D-state usually do so
because of
a bug. (Lost request or interrupt because of internal errors.) Fix
that, and the
problem never happens. So the hard problem of killing stuff stuck in
D-state
doesn't need a solution - fix the real bug instead. Having a way to
kill such processes
will only mean that hard-to-trigger bugs won't get fixed because there is
workaround. This is bad for stability too, as broken hw drivers can
hang the
kernel even if a better process killer comes into existence.

> Kicking the program execution out of
>kernel space would be sufficient to "unstick" the process - and coupling that
>with an automatic KILL signal may not be a bad idea.
>
>I'm pretty sure that someone will think of a way why this wouldn't work with
>very little effort. Please enlighten me?
>
>
It is doable - but not with "very little effort". I have outlined above
the trouble
you get if you trivially wake up the sleeping process. Another trivial
alternative
is to remove the process while it is in-kernel. The downside is that it
might
be holding a lock or semaphore that won't ever be released this way.
And no,
locks aren't necessarily accounted for anywhere. (They are implicitly
accounted for by the fact that a process exists whose future execution
path leads to the release of said lock.) Explicit accounting that allows
lock-breaking is deemed too slow, and what to do about the data structures
the lock/semaphore were protecting?

The stuck process is a sign of another bug - better fix that one.

Helge Hafting




2004-11-04 10:08:00

by Matthias Andree

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wed, 03 Nov 2004, Gene Heskett wrote:

> >Yes it does - the problem is that not all resources are managed
> >by processes. Some allocations are managed by drivers, so a driver
> >bug can get the device into a unuseable state _and_ tie up the
> >process(es) that were using the driver at the moment.
>
> This from my viewpoint, is wrong. The kernel, and only the kernel
> should be ultimately responsible for handing out resources, and
> reclaiming at its convienience.

Linux's driver model is the way it is. If you want the kernel to clean
up after a driver has puked, you need something like a microkernel I
believe, where only a minimal core kernel is a real kernel and where all
the drivers are actually in user-space, but that's no longer Linux then.

I'm not reflecting the down- and upsides to of this as I have no
experience with microkernels (and have never used OS9 or GNU Hurd
either). I know there have been attempts to port Linux to a Microkernel
but I don't know what's come out of it.

--
Matthias Andree

2004-11-04 10:21:20

by DervishD

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Hi Bill :)

* Bill Davidsen <[email protected]> dixit:
> > Or write a little program that just 'wait()'s for the specified
> >PID's. That is perfectly portable IMHO. But I must admit that the
> >preferred way should be killing the parent. 'init' will reap the
> >children after that.
> You can't wait() for the process, you have to use waitfor(), and the
> last time I tried that it didn't work, although I don't remember the
> symptom beyond that.

You can't wait for other's children. OTOH, if we talk about your
children, you can do wait() or waitpid() (I assume that you referred
to waitpid(), since there isn't waitfor() AFAIK). The only difference
is that wait suspends the process until information from a child is
available.

If you are talking about others' children, then your call to
waitpid() (or wait()) failed with ECHILD: not your child.

Ra?l N??ez de Arenas Coronado

--
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

2004-11-04 10:24:32

by DervishD

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Hi Bill :)

* Bill Davidsen <[email protected]> dixit:
> > I think that the parent (which is whatever process did the fork
> >when you clicked your mouse) is still alive and forgetting to do the
> >'wait()' for its children.
> It would be good to know what the PPID is, from ps or similar. Things
> from X are a pain, the parent is often something you don't want to kill.
> Sometimes you can reparent from command line, "bash -c foo&" or similar,
> so the parent can be killed without logging out.

Just use ps to reveal the family tree. Is not that hard ;)

> I would swear that the parent *is* init in some cases, which is puzzling
> since they should be reaped.

But that's OK :))) When a parent dies without waiting for its
children, the zombies are reparented to init. That's correct. Then
init will wait for them. The problem is that sometimes the signals
doesn't arrive or the like. Then the zombies are laying around a bit,
until a timer in 'init' reaps them. That's correct too: init can only
wait for children when it receives SIGCHLD or periodically, using a
timer. I've written a init program and that's the way I do it, just
in case some signal gets lost.

If init is the parent, all works ok, just wait a bit and all
those zombies will really die ;)

Ra?l N??ez de Arenas Coronado

--
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

2004-11-04 12:02:21

by Gene Heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thursday 04 November 2004 02:19, Jan Knutar wrote:
>On Wednesday 03 November 2004 23:08, Gene Heskett wrote:
>> Its a dead horse Tom, lets bury it. I've rebooted to 4 new
>> kernels since that time as I march toward getting caught up with
>> whatever bk(nn) is out today. Other than that, which took place
>> on bk7's watch, its all working rather well.
>
>Since nobody else seems to have said it, it would be a good idea
>to enable sysrq and do a sysrq-T the next time (if) this happens,
>so that there would be atleast some information to go on.

I'e had that turned on since forever Jan, but usually, when its hung
someplace, its well and truely hung, and hardware reset button time.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-11-04 12:14:51

by Jan Knutar

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thursday 04 November 2004 13:57, Gene Heskett wrote:

> I'e had that turned on since forever Jan, but usually, when its hung
> someplace, its well and truely hung, and hardware reset button time.

Are you saying that these zombies (or tasks stuck in state D) also make
sysrq-T hang, and not list all tasks?

2004-11-04 12:23:08

by Gene Heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thursday 04 November 2004 07:12, Jan Knutar wrote:
>On Thursday 04 November 2004 13:57, Gene Heskett wrote:
>> I'e had that turned on since forever Jan, but usually, when its
>> hung someplace, its well and truely hung, and hardware reset
>> button time.
>
>Are you saying that these zombies (or tasks stuck in state D) also
> make sysrq-T hang, and not list all tasks?

The machine is hung. No ssh, no ping response, the only button that
works is the hardware reset on the front of the tower.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-11-04 12:33:36

by Jan Knutar

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thursday 04 November 2004 14:18, Gene Heskett wrote:

> The machine is hung. No ssh, no ping response, the only button that
> works is the hardware reset on the front of the tower.

I must've missed where the thread went from zombies into totally hung
machine. My apologies for the noise.

2004-11-04 12:41:57

by Gene Heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thursday 04 November 2004 07:12, Jan Knutar wrote:
>On Thursday 04 November 2004 13:57, Gene Heskett wrote:
>> I'e had that turned on since forever Jan, but usually, when its
>> hung someplace, its well and truely hung, and hardware reset
>> button time.
>
>Are you saying that these zombies (or tasks stuck in state D) also
> make sysrq-T hang, and not list all tasks?

I thought I'd test it right now while the system is runnng normally,
but I got only a beep from the console, so I went to
Documentation/sysrq.txt to make sure I was doing it right, and it is
_not_ working right now. But it is compiled in according to a make
xconfig, or a grep of the .config.

[root@coyote linux-2.6.10-rc1-bk13]# grep SYSRQ .config
CONFIG_MAGIC_SYSRQ=y

I get a couple of beeps from the console, but thats the limit of the
response, and a tail -f on the log shows nothing. I also logged into
VC2, and tried it there, but that attempt didn't even get me a beep,
several times.

The keyboard is a cheap ($24) M$ with a few extra buttons that don't
do anything along the top. And getting a bit creaky in its old age,
a lot like me, but I'm about 68 years older than the keyboard :)

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-11-04 13:01:35

by Ian Campbell

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thu, 2004-11-04 at 07:39 -0500, Gene Heskett wrote:
> On Thursday 04 November 2004 07:12, Jan Knutar wrote:
> >On Thursday 04 November 2004 13:57, Gene Heskett wrote:
> >> I'e had that turned on since forever Jan, but usually, when its
> >> hung someplace, its well and truely hung, and hardware reset
> >> button time.
> >
> >Are you saying that these zombies (or tasks stuck in state D) also
> > make sysrq-T hang, and not list all tasks?
>
> I thought I'd test it right now while the system is runnng normally,
> but I got only a beep from the console, so I went to
> Documentation/sysrq.txt to make sure I was doing it right, and it is
> _not_ working right now. But it is compiled in according to a make
> xconfig, or a grep of the .config.

It can also be enabled/disabled at runtime, Documentation/sysrq.txt says
that the default now is on (but that it used to default to off). Perhaps
it is getting turned off somewhere in your boot scripts etc.

You can check with

$ cat /proc/sys/kernel/sysrq
1

> The keyboard is a cheap ($24) M$ with a few extra buttons that don't
> do anything along the top. And getting a bit creaky in its old age,
> a lot like me, but I'm about 68 years older than the keyboard :)

Documentation/sysrq.txt also says:

* How do I use the magic SysRq key?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
On x86 - You press the key combo 'ALT-SysRq-<command key>'. Note - Some
keyboards may not have a key labeled 'SysRq'. The 'SysRq' key is
also known as the 'Print Screen' key. Also some keyboards cannot
handle so many keys being pressed at the same time, so you might
have better luck with "press Alt", "press SysRq", "release Alt",
"press <command key>", release everything.

Perhaps your keyboard is one of those that can't cope with all those
keys?

Ian.

--
Ian Campbell, Senior Design Engineer
Web: http://www.arcom.com
Arcom, Clifton Road, Direct: +44 (0)1223 403 465
Cambridge CB1 7EA, United Kingdom Phone: +44 (0)1223 411 200

2004-11-04 13:11:45

by Douglas McNaught

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Gene Heskett <[email protected]> writes:

> [root@coyote linux-2.6.10-rc1-bk13]# grep SYSRQ .config
> CONFIG_MAGIC_SYSRQ=y

Did you also enable it in /proc?

-Doug

2004-11-04 13:56:23

by Gene Heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thursday 04 November 2004 07:29, Jan Knutar wrote:
>On Thursday 04 November 2004 14:18, Gene Heskett wrote:
>> The machine is hung. No ssh, no ping response, the only button
>> that works is the hardware reset on the front of the tower.
>
>I must've missed where the thread went from zombies into totally
> hung machine. My apologies for the noise.

It went from an unkillable process (gnomeradio) that was blocking
other programs like tvtime with its locks on /dev/video0, to
completely hung at "stopping alsasound" when I tried to reboot. That
required the reset button to get going again.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-11-04 14:07:53

by Gene Heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thursday 04 November 2004 08:01, Ian Campbell wrote:
>On Thu, 2004-11-04 at 07:39 -0500, Gene Heskett wrote:
>> On Thursday 04 November 2004 07:12, Jan Knutar wrote:
>> >On Thursday 04 November 2004 13:57, Gene Heskett wrote:
>> >> I'e had that turned on since forever Jan, but usually, when its
>> >> hung someplace, its well and truely hung, and hardware reset
>> >> button time.
>> >
>> >Are you saying that these zombies (or tasks stuck in state D)
>> > also make sysrq-T hang, and not list all tasks?
>>
>> I thought I'd test it right now while the system is runnng
>> normally, but I got only a beep from the console, so I went to
>> Documentation/sysrq.txt to make sure I was doing it right, and it
>> is _not_ working right now. But it is compiled in according to a
>> make xconfig, or a grep of the .config.
>
>It can also be enabled/disabled at runtime, Documentation/sysrq.txt
> says that the default now is on (but that it used to default to
> off). Perhaps it is getting turned off somewhere in your boot
> scripts etc.
>
>You can check with
>
>$ cat /proc/sys/kernel/sysrq
>1
>
>> The keyboard is a cheap ($24) M$ with a few extra buttons that
>> don't do anything along the top. And getting a bit creaky in its
>> old age, a lot like me, but I'm about 68 years older than the
>> keyboard :)
>
>Documentation/sysrq.txt also says:
>
>* How do I use the magic SysRq key?
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>On x86 - You press the key combo 'ALT-SysRq-<command key>'. Note -
> Some keyboards may not have a key labeled 'SysRq'. The 'SysRq' key
> is also known as the 'Print Screen' key. Also some keyboards cannot
> handle so many keys being pressed at the same time, so you might
> have better luck with "press Alt", "press SysRq", "release Alt",
> "press <command key>", release everything.
>
>Perhaps your keyboard is one of those that can't cope with all those
>keys?
>
>Ian.

Possibly, but OTOH,
[root@coyote root]# cat /proc/sys/kernel/sysrq
0

And no, I'm not turning it off anyplace in the boot proceedure. An
'echo 1 >/proc/sys/kernel/sysrq', and repeating the keypresses now
gets a boatload of stuff in the logs, but nothing on the console.

The logs look something like this:

Nov 4 08:59:29 coyote kernel: kdeinit S C0453F08 0 18964 3327 18965 18963 (NOTLB)
Nov 4 08:59:29 coyote kernel: c657ae8c 00200082 c6820120 c0453f08 0000202c 00000000 b4d18366 0000202c
Nov 4 08:59:29 coyote kernel: 00002ecd b4d1e78f 0000202c c6820600 c682075c 0217d045 c657aea0 fffffff5
Nov 4 08:59:29 coyote kernel: c657aedc c033bca3 c657aea0 0217d045 c657aec4 dfa88ea0 ee3aeea0 0217d045
Nov 4 08:59:29 coyote kernel: Call Trace:
Nov 4 08:59:29 coyote kernel: [<c033bca3>] schedule_timeout+0x63/0xc0
Nov 4 08:59:29 coyote kernel: [<c0120150>] process_timeout+0x0/0x10
Nov 4 08:59:29 coyote kernel: [<c012c12f>] futex_wait+0x12f/0x1a0
Nov 4 08:59:29 coyote kernel: [<c0114160>] default_wake_function+0x0/0x20
Nov 4 08:59:29 coyote kernel: [<c0114160>] default_wake_function+0x0/0x20
Nov 4 08:59:29 coyote kernel: [<c012c418>] do_futex+0x48/0xa0
Nov 4 08:59:29 coyote kernel: [<c012c55e>] sys_futex+0xee/0x100
Nov 4 08:59:29 coyote kernel: [<c01040a9>] sysenter_past_esp+0x52/0x71
Nov 4 08:59:29 coyote kernel: kdeinit S C0453A60 0 18965 3327 18966 18964 (NOTLB)
Nov 4 08:59:29 coyote kernel: dfa88e8c 00200082 c6820120 c0453a60 dfa88eac 00000000 ed99e990 00000000
Nov 4 08:59:29 coyote kernel: 00006be7 b816258b 0000202c c6820120 c682027c 0217d07c dfa88ea0 fffffff5
Nov 4 08:59:29 coyote kernel: dfa88edc c033bca3 dfa88ea0 0217d07c dfa88ec4 c0459928 c657aea0 0217d07c
Nov 4 08:59:29 coyote kernel: Call Trace:
Nov 4 08:59:29 coyote kernel: [<c033bca3>] schedule_timeout+0x63/0xc0
Nov 4 08:59:29 coyote kernel: [<c0120150>] process_timeout+0x0/0x10
Nov 4 08:59:29 coyote kernel: [<c012c12f>] futex_wait+0x12f/0x1a0
Nov 4 08:59:29 coyote kernel: [<c0114160>] default_wake_function+0x0/0x20
Nov 4 08:59:29 coyote kernel: [<c0114160>] default_wake_function+0x0/0x20
Nov 4 08:59:29 coyote kernel: [<c012c418>] do_futex+0x48/0xa0
Nov 4 08:59:29 coyote kernel: [<c012c55e>] sys_futex+0xee/0x100
Nov 4 08:59:29 coyote kernel: [<c01040a9>] sysenter_past_esp+0x52/0x71
Nov 4 08:59:29 coyote kernel: kdeinit S C0453A60 0 18966 3327 18965 (NOTLB)
Nov 4 08:59:29 coyote kernel: ee3aee8c 00200082 e770fb00 c0453a60 ee3aeeac 00000000 ed99e990 00000000
Nov 4 08:59:29 coyote kernel: 00001e29 b4b250fe 0000202c e770fb00 e770fc5c 0217d043 ee3aeea0 fffffff5
Nov 4 08:59:29 coyote kernel: ee3aeedc c033bca3 ee3aeea0 0217d043 666c6573 c657aea0 c039be78 0217d043
Nov 4 08:59:29 coyote kernel: Call Trace:
Nov 4 08:59:29 coyote kernel: [<c033bca3>] schedule_timeout+0x63/0xc0
Nov 4 08:59:29 coyote kernel: [<c0120150>] process_timeout+0x0/0x10
Nov 4 08:59:29 coyote kernel: [<c012c12f>] futex_wait+0x12f/0x1a0
Nov 4 08:59:29 coyote kernel: [<c0114160>] default_wake_function+0x0/0x20
Nov 4 08:59:29 coyote kernel: [<c0114160>] default_wake_function+0x0/0x20
Nov 4 08:59:29 coyote kernel: [<c012c418>] do_futex+0x48/0xa0
Nov 4 08:59:29 coyote kernel: [<c012c55e>] sys_futex+0xee/0x100
Nov 4 08:59:29 coyote kernel: [<c01040a9>] sysenter_past_esp+0x52/0x71

There is a lot more of that of that above that snip, several pages.
And of course the system seems to be running fine ATM. :-)
But I'm learning, and that echo will go into my rc.local as soon as
I'm done here.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-11-04 14:12:52

by Gene Heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thursday 04 November 2004 08:10, Doug McNaught wrote:
>Gene Heskett <[email protected]> writes:
>> [root@coyote linux-2.6.10-rc1-bk13]# grep SYSRQ .config
>> CONFIG_MAGIC_SYSRQ=y
>
>Did you also enable it in /proc?
>
>-Doug

I just now discovered it defaults to a 0, so I put an
echo 1 >proc/sys/kermel/sysrq
in rc.local just now.

Thanks for the heads up.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-11-04 14:25:48

by Ian Campbell

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thu, 2004-11-04 at 09:07 -0500, Gene Heskett wrote:
> [root@coyote root]# cat /proc/sys/kernel/sysrq
> 0

Aha :-)

> And no, I'm not turning it off anyplace in the boot proceedure.

Something must be -- you can see in drivers/char/sysrq.c that
sysrq_enabled is set to 1 by default and according to bkbits.net it has
been that way since at least 2.4.0.

does the following not come up with any culprits?
# grep -r sysrq /etc

Ian.

--
Ian Campbell, Senior Design Engineer
Web: http://www.arcom.com
Arcom, Clifton Road, Direct: +44 (0)1223 403 465
Cambridge CB1 7EA, United Kingdom Phone: +44 (0)1223 411 200


_____________________________________________________________________
The message in this transmission is sent in confidence for the attention of the addressee only and should not be disclosed to any other party. Unauthorised recipients are requested to preserve this confidentiality. Please advise the sender if the addressee is not resident at the receiving end. Email to and from Arcom is automatically monitored for operational and lawful business reasons.

This message has been virus scanned by MessageLabs.

2004-11-04 14:26:35

by DervishD

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Hi Gene :)

* Gene Heskett <[email protected]> dixit:
> Possibly, but OTOH,
> [root@coyote root]# cat /proc/sys/kernel/sysrq
> 0
>
> And no, I'm not turning it off anyplace in the boot proceedure. An
> 'echo 1 >/proc/sys/kernel/sysrq', and repeating the keypresses now
> gets a boatload of stuff in the logs, but nothing on the console.

Well, the stuff goes to the logs and not the console because of
the console log level. You can change that using proc, too. Look in
/proc/sys/kernel/printk (well, at least under 2.4.x). You'll see
four numbers. The first one is the console loglevel. Any message
directed to syslog with a priority higher than this number will be
printed in the console. Otherwise they won't.

The second number is the default message level. Any message
without a priority will get this priority.

The third number is the highest value you can assign to the first
number (the console loglevel).

The fourth number is the default value for the first number.

The interesting number for you is the first one. Set it to a
correct value for you (see syslog(2) to see what the numbers mean).

Ra?l N??ez de Arenas Coronado

--
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

2004-11-04 14:29:43

by Paul Slootman

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

DervishD <[email protected]> wrote:
>
> If init is the parent, all works ok, just wait a bit and all
>those zombies will really die ;)

I recently had a system with serial console where some some reason the
serial port was stopped. This meant that init blocked while writing some
message (e.g. "respawning too rapidly"), and that meant it stopped
reaping those zombie processes. The list of these zombie processes with
PPID == 1 was amazing. The only thing that helped was rebooting after
replacing the serial console cable.

(Kernel 2.4.25, sysvinit 2.85 in case you're wondering.)


Paul Slootman

2004-11-04 14:34:30

by tlaurent

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Selon Gene Heskett <[email protected]>:

> On Thursday 04 November 2004 08:10, Doug McNaught wrote:
> >Gene Heskett <[email protected]> writes:
> >> [root@coyote linux-2.6.10-rc1-bk13]# grep SYSRQ .config
> >> CONFIG_MAGIC_SYSRQ=y
> >
> >Did you also enable it in /proc?
> >
> >-Doug
>
> I just now discovered it defaults to a 0, so I put an
> echo 1 >proc/sys/kermel/sysrq
> in rc.local just now.

You might also want to have a look at /etc/sysctl.conf. Some distros put a
kernel.sysrq=0 in it...

Cheers,
Thibaut

>
> Thanks for the heads up.
>
> --
> Cheers, Gene
> "There are four boxes to be used in defense of liberty:
> soap, ballot, jury, and ammo. Please use in that order."
> -Ed Howdershelt (Author)
> 99.28% setiathome rank, not too shabby for a WV hillbilly
> Yahoo.com attorneys please note, additions to this message
> by Gene Heskett are:
> Copyright 2004 by Maurice Eugene Heskett, all rights reserved.


2004-11-04 14:57:06

by Gene Heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thursday 04 November 2004 09:23, Paul Slootman wrote:
>DervishD <[email protected]> wrote:
>> If init is the parent, all works ok, just wait a bit and all
>>those zombies will really die ;)
>
>I recently had a system with serial console where some some reason
> the serial port was stopped. This meant that init blocked while
> writing some message (e.g. "respawning too rapidly"), and that
> meant it stopped reaping those zombie processes. The list of these
> zombie processes with PPID == 1 was amazing. The only thing that
> helped was rebooting after replacing the serial console cable.
>
>(Kernel 2.4.25, sysvinit 2.85 in case you're wondering.)

Both serial ports are already in use here Paul, one for heyu and x10
stuff related to my home automation (mostly the outside lights), and
the other to my Belkin ups, whose usb interface has never worked, so
I'm stuck using serial for the BullDog interface to gkrellm. I'd
like to find a cheap pci rocketport as I have another vintage box in
the basement that could use this machine as a network gateway then.
Right now its on PL2303 usb<->serial convertor but somethings wrong
with the handshaking on that end.

>Paul Slootman
>
>-
>To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-11-04 15:10:31

by Gene Heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thursday 04 November 2004 09:24, Ian Campbell wrote:
>grep -r sysrq /etc

Gets me a bunch. The revelant ones would be:
/etc/rc.d/rc3.d/K20iscsi: if [ -e /proc/sys/kernel/sysrq ] ; then
/etc/rc.d/rc3.d/K20iscsi: echo "1" > /proc/sys/kernel/sysrq

and
/etc/rc.d/rc.local:# Turn on the magic sysrq keys
/etc/rc.d/rc.local:echo 1 >/proc/sys/kernel/sysrq

But, what about this:
/etc/sysctl.conf:# Disables the magic-sysrq key
/etc/sysctl.conf:kernel.sysrq = 0
which I just commented out...

And this:
/etc/linuxconf/archive/Office/etc/sysctl.conf,v:kernel.sysrq = 0
But everything there is dated early 2001. I think its filesystem
cruft nowadays, subject to being a space patrol target eventually.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-11-04 15:13:18

by Gene Heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thursday 04 November 2004 09:26, DervishD wrote:
> Hi Gene :)
>
> * Gene Heskett <[email protected]> dixit:
>> Possibly, but OTOH,
>> [root@coyote root]# cat /proc/sys/kernel/sysrq
>> 0
>>
>> And no, I'm not turning it off anyplace in the boot proceedure.
>> An 'echo 1 >/proc/sys/kernel/sysrq', and repeating the keypresses
>> now gets a boatload of stuff in the logs, but nothing on the
>> console.
>
> Well, the stuff goes to the logs and not the console because of
>the console log level. You can change that using proc, too. Look in
>/proc/sys/kernel/printk (well, at least under 2.4.x). You'll see
>four numbers. The first one is the console loglevel. Any message
>directed to syslog with a priority higher than this number will be
>printed in the console. Otherwise they won't.
>
> The second number is the default message level. Any message
>without a priority will get this priority.
>
> The third number is the highest value you can assign to the
> first number (the console loglevel).
>
> The fourth number is the default value for the first number.
>
> The interesting number for you is the first one. Set it to a
>correct value for you (see syslog(2) to see what the numbers mean).
>
> Ra?l N??ez de Arenas Coronado

I have it going to the logs as the prefered method as thats permanent
whereas the console output is 100% volatile. That way I can look at
the logs when the machine has been made functional again.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-11-04 15:14:22

by Gene Heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thursday 04 November 2004 09:42, [email protected] wrote:
>Selon Gene Heskett <[email protected]>:
>> On Thursday 04 November 2004 08:10, Doug McNaught wrote:
>> >Gene Heskett <[email protected]> writes:
>> >> [root@coyote linux-2.6.10-rc1-bk13]# grep SYSRQ .config
>> >> CONFIG_MAGIC_SYSRQ=y
>> >
>> >Did you also enable it in /proc?
>> >
>> >-Doug
>>
>> I just now discovered it defaults to a 0, so I put an
>> echo 1 >proc/sys/kermel/sysrq
>> in rc.local just now.
>
>You might also want to have a look at /etc/sysctl.conf. Some distros
> put a kernel.sysrq=0 in it...

And I just put a comment in front of that puppy!

>Cheers,
>Thibaut
>
>> Thanks for the heads up.
>>
>> --
>> Cheers, Gene
>> "There are four boxes to be used in defense of liberty:
>> soap, ballot, jury, and ammo. Please use in that order."
>> -Ed Howdershelt (Author)
>> 99.28% setiathome rank, not too shabby for a WV hillbilly
>> Yahoo.com attorneys please note, additions to this message
>> by Gene Heskett are:
>> Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-11-04 16:01:05

by kernel

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wed, 2004-11-03 at 11:47, Gene Heskett wrote:
> Finding them is usually an exersize in stretching the
> top window out till its about 20 screens high as its always going to
> be at the bottom of the list.

use 'htop' instead, more flexible in showing and parsing.


-fd



2004-11-04 16:18:42

by Gene Heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thursday 04 November 2004 11:01, kernel wrote:
>On Wed, 2004-11-03 at 11:47, Gene Heskett wrote:
>> Finding them is usually an exersize in stretching the
>> top window out till its about 20 screens high as its always going
>> to be at the bottom of the list.
>
>use 'htop' instead, more flexible in showing and parsing.
>
And where is htop, it apparently isn't part of an FC2 install.
>
>-fd

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-11-04 16:30:39

by Pedro Venda

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Jim Nelson wrote:
> DervishD wrote:
>
>> Hi Gene :)
>>
>> * Gene Heskett <[email protected]> dixit:
>>
>>>> Then the children are reparented to 'init' and 'init' gets rid
>>>> of them. That's the way UNIX behaves.
>>>
>>>
>>> Unforch, I've *never* had it work that way. Any dead process I've
>>> ever had while running linux has only been disposable by a reboot.
>>
>>
>>
>> Well, you know, shit happens... Anyway, could you define 'dead'?
>> Because if you're talking about zombies whose parent dies, they're
>> killable easily: just wait until init reaps them (usually in less
>> than 5 minutes since they dead). If you are talking about zombies who
>> has their parent alive, then it's a bug in the application, not the
>> kernel. In fact I wouldn't like if the kernel reaps my children
>> before I do, just in case I want to do something.
>>
>> If you're talking about unkillable processes (those stuck in
>> disk-sleep state), you're right: only rebooting can kill them
>> (although sometimes they go out of D state and die normally). Bad
>> luck for you if any dead process you've ever had while running linux
>> has been of this kind :(
>>
>
> I did this to myself a number of times when I was first learning Samba -
> even an ls would become unkillable. You couldn't rmmod smb, since it
> was in use, and you couldn't kill the process, since it was waiting on a
> syscall. Ergh.

the exact same happened to me, but my case was with ntfs. zip processes
just got stuch in "D" state because of some unhandled names... i
couldn't kill the processes. i don't think this is an easy thing to do,
tough it should be possible to kill -9 these processes and make them exit.

is this feasible?

regards,
pedro venda.
--

Pedro Jo?o Lopes Venda
email: [email protected]
http://maxwell.rnl.ist.utl.pt

Equipa de Administra??o de Sistemas
Rede das Novas Licenciaturas (RNL)
Instituto Superior T?cnico
http://www.rnl.ist.utl.pt
http://mega.ist.utl.pt

2004-11-04 16:46:07

by kernel

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thu, 2004-11-04 at 11:18, Gene Heskett wrote:

> And where is htop, it apparently isn't part of an FC2 install.
> >


http://htop.sourceforge.net/

from site above;
Comparison between htop and top
* In 'htop' you can scroll the list vertically and horizontally to
see all processes and complete command lines.
* In 'top' you are subject to a delay for each unassigned key you
press (especially annoying when multi-key escape sequences are
triggered by accident).
* 'htop' starts faster ('top' seems to collect data for a while
before displaying anything).
* In 'htop' you don't need to type the process number to kill a
process, in 'top' you do.
* In 'htop' you don't need to type the process number or the
priority value to renice a process, in 'top' you do.
* 'htop' supports mouse operation, 'top' doesn't
* 'top' is older, hence, more used and tested.



cheers!

-fd

2004-11-04 17:20:27

by Alex Bennee

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thu, 2004-11-04 at 10:04, Helge Hafting wrote:
> Russell Miller wrote:
> >On Wednesday 03 November 2004 16:15, Jim Nelson wrote:
> >
> >Anyway, is there a way to simply signal a syscall that it is to be interrupted
> >and forcibly cause the syscall to end?
> >
> There is a way. Processes go into D state happens all the time
> when waiting for disk io or similiar. Then the io happens a few ms later,
> and the fs or device driver tells the kernel to wake up the process
> so it gets a chance at the next scheduling opportunity. So the mechanism to
> unstick a prcess exists, and is used by every device driver that
> use sleeping. Which is most of them.
>
> Breakage happens when something never comes out of D-state.
> One could write a trivial syscall (or addition to "kill") that "wakes"
> processes waiting for io. It itsn't hard to do at all - just copy the
> waking code from any device driver. This will allow to kill and
> fully remove any process that hangs around in D-state. This might
> also release other stuck resources as the syscall
> continues, returns to userspace, and allows the process to die.
>
> Unfortunately, this isn't enough. In some cases the syscall
> expects the io device interrupt handler to have done something
> vital - but this haven't happened when we forcibly wakes a process.
> We can hope for an io error, but might get a crash instead. This
> can be fixes with a lot of work - basically check at every wakeup
> if the process were woken by this new killing mechanism and
> act accordingly. It shouldn't be hard, but _lots_ of work
> inspecting every sleeping point, at least every device driver.

Timeouts and interruptible sleeps are the two ways to solve the problem.
All good drivers should have covering timeouts in case the event they
where hoping for never happens.

If the code path that assumes magic has happened after it wakes up
doesn't check its not defensive enough. Also you can make tasks
interruptible so signals can get through:

result = wait_event_interruptible(dev->waitq,dev_irq_event(dev));

if (result) {
printk(KERN_ALERT "dev_irq_wait: Interrupted by a signal\n");
return -ERESTARTSYS;
};

As you have noted you can't always make things interruptible, but decent
timeouts should always exist. Hardware has bugs too!
--
Alex, Kernel Hacker: http://www.bennee.com/~alex/

In English, every word can be verbed. Would that it were so in our
programming languages.

2004-11-04 18:02:42

by Gene Heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thursday 04 November 2004 11:47, kernel wrote:
>On Thu, 2004-11-04 at 11:18, Gene Heskett wrote:
>> And where is htop, it apparently isn't part of an FC2 install.
>
>http://htop.sourceforge.net/
>
Thanks, got it. Looks good, more thanks...

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-11-04 18:25:02

by DervishD

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Hi Paul :)

* Paul Slootman <[email protected]> dixit:
> > If init is the parent, all works ok, just wait a bit and all
> >those zombies will really die ;)
> I recently had a system with serial console where some some reason the
> serial port was stopped. This meant that init blocked while writing some
> message (e.g. "respawning too rapidly"), and that meant it stopped
> reaping those zombie processes. The list of these zombie processes with
> PPID == 1 was amazing. The only thing that helped was rebooting after
> replacing the serial console cable.

It looks like a bug in sysvinit: it shouldn't print anything on
the console but use syslog and specify that the console NEVER shall
be used to print anything even when there is no syslogd running. I'll
make sure that it doesn't happen in my VCinit.

Thanks for the information :)

Ra?l N??ez de Arenas Coronado

--
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

2004-11-04 19:23:54

by Bill Davidsen

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

DervishD wrote:
> Hi Bill :)
>
> * Bill Davidsen <[email protected]> dixit:
>
>>> I think that the parent (which is whatever process did the fork
>>>when you clicked your mouse) is still alive and forgetting to do the
>>>'wait()' for its children.
>>
>>It would be good to know what the PPID is, from ps or similar. Things
>>from X are a pain, the parent is often something you don't want to kill.
>>Sometimes you can reparent from command line, "bash -c foo&" or similar,
>>so the parent can be killed without logging out.
>
>
> Just use ps to reveal the family tree. Is not that hard ;)

That's what I just said, the original poster should tell us what the
PPID is, which may help someone help the OP.
>
>
>>I would swear that the parent *is* init in some cases, which is puzzling
>>since they should be reaped.
>
>
> But that's OK :))) When a parent dies without waiting for its
> children, the zombies are reparented to init. That's correct. Then
> init will wait for them. The problem is that sometimes the signals
> doesn't arrive or the like. Then the zombies are laying around a bit,
> until a timer in 'init' reaps them. That's correct too: init can only
> wait for children when it receives SIGCHLD or periodically, using a
> timer. I've written a init program and that's the way I do it, just
> in case some signal gets lost.
>
> If init is the parent, all works ok, just wait a bit and all
> those zombies will really die ;)

Actually the ones in i/o probably won't, since the kernel either missed
the completion or didn't time out if the hardware missed sending the
int. And even plain non-i/o zombies, just how long "a bit" are you
proposing?

Over Thanksgiving weekend I will try to look at the init code and see if
a signal could be used to initiate a forced reap without waiting for the
timer. By "look at" I mean not only "could I do that" but is it a good
thing to do, before someone starts trying to explain that it's going to
do something evil not to wait for the timer...

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2004-11-04 19:27:46

by Bill Davidsen

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Gene Heskett wrote:
> On Wednesday 03 November 2004 14:33, [email protected] wrote:
>
>>On Wed, 03 Nov 2004 14:26:23 EST, Gene Heskett said:
>>
>>>Well, since the "device", a bt878 based Haupagge tv card is
>>>sitting in a pci socket, thats even more drastic than a reboot.
>>
>>Not if you have a good hot-swap PCI cage. ;)
>>
>>Anyhow, that points even more at a driver issue for the bt878 -
>>if you can get Sysrq-T output, where does it say the hung process is
>>inside the kernel?
>
>
> Thats another thing I've had compiled in since forever, but it so
> seldom actually *works*, I've tended to forget about it.
>
You have it enabled as well as compiled in, I'm sure.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2004-11-04 19:33:16

by Bill Davidsen

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

DervishD wrote:
> Hi Bill :)
>
> * Bill Davidsen <[email protected]> dixit:
>
>>> Or write a little program that just 'wait()'s for the specified
>>>PID's. That is perfectly portable IMHO. But I must admit that the
>>>preferred way should be killing the parent. 'init' will reap the
>>>children after that.
>>
>>You can't wait() for the process, you have to use waitfor(), and the
>>last time I tried that it didn't work, although I don't remember the
>>symptom beyond that.
>
>
> You can't wait for other's children. OTOH, if we talk about your
> children, you can do wait() or waitpid() (I assume that you referred
> to waitpid(), since there isn't waitfor() AFAIK). The only difference
> is that wait suspends the process until information from a child is
> available.

Yes, thank you, I was thinking "wait for the PID" and typed that.
>
> If you are talking about others' children, then your call to
> waitpid() (or wait()) failed with ECHILD: not your child.

That's what happened when I tried it a few months ago. I suppose one
could try sending a SIGCHLD to the parent and see if it does something
helpful.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2004-11-04 20:11:01

by Bill Davidsen

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Russell Miller wrote:
> On Wednesday 03 November 2004 17:03, Doug McNaught wrote:
>
>
>>It was already mentioned in this thread that the bookkeeping required
>>to clean up properly from such an abort would add a lot of overhead
>>and slow down the normal, non-buggy case.
>>
>
> I am going to continue pursuing this at the risk of making a bigger fool of
> myself than I already am, but I want to make sure that I understand the
> issues - and I did read the message you are referring to.
>
> I think what you are saying is that there is kind of a race condition here.

At least in the usual sense, no. There is a condition from which there
is no graceful way back, only forward.

> When something is on the wait queue, it has to be followed through to
> completion. An interrupt could be received at any time, and if it's taken
> off of the wait queue prematurely, it'll crash the kernel, because the
> interrupt has no way of telling that.

That's part of it, but in some cases there's also i/o in progress, the
hardware may not have a way to HALT the transfer, so the memory in
question can't be used for something else.
>
> That's fine as it goes, I understand that. But I submit that this is a
> horrible design. I've been bitten by this more than once - usually regarding
> broken NFS connections.
>
> But what I don't understand is why the bookkeeping would be so inefficient.
> It seems to me that all that would be required is a bitfield of some sort.
> If that position in the qait queue becomes invalid, when the interrupt is
> received to process it, the kernel notes that a flag is set invalidating that
> part of the wait queue, dumps the output to dave null, and goes on to the
> next. This doesn't seem inefficient to me, unless I'm missing something.
> A little more inefficient, yes, but not to near the cost that seems to be
> implied.
>
> And I also have to ask this question: what is more inefficient, slowing down
> processing of output waiting on the queue, or having to reboot when a process
> gets stuck due to faulty drivers? At the very least, a compile option seems
> like it would be worthwhile for those that would like this behavior.
>
> And I probably am. Missing something, that is.

You are asking to program around a problem rather than fix it. These
hangs (usually) happen because the hardware behaviour is either
undocumented, incorrectly documented, or flat out broken. Second likely
cause is a bug in the driver.

In the case of a real bug, adding code to bypass the error instead of
fixing it is more effort, more complex in most cases, and therefore less
reliable. Where the hardware does something unexpected, the driver needs
to fit the behaviour rather than the spec. And where the hardware is
broken, you fix or replace it. None of those cases suggest "pretend it
didn't happen," because in most cases you can't.

What I think you are missing:

Processes hung in D state are the result of real problems, and ignoring
rather than fixing them is like giving a cancer patient a face lift; it
doesn't fix the problem, it just gives you a good looking corpse.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2004-11-04 20:15:01

by Bill Davidsen

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Mitchell Blank Jr wrote:
> Russell Miller wrote:
>
>>Couldn't ring 1 be used to make
>>sure an errant driver doesn't drop the kernel, at least on x86 machines?
>
>
> Not really -- drivers could still do things like mis-program their associated
> hardware making it do DMA writes all over kernel memory (just as one example)
>
> Basically it'd add a lot of complexity (and inefficiency) without adding
> much real safety.

It would be nice on x86 to run ring 1 for kernel debugging, getting
faults at appropriate points. Sorry, I'm an old MULTICS guy, wish
Honeywell would OS it.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2004-11-04 20:19:22

by Bill Davidsen

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Gene Heskett wrote:
> On Thursday 04 November 2004 07:12, Jan Knutar wrote:
>
>>On Thursday 04 November 2004 13:57, Gene Heskett wrote:
>>
>>>I'e had that turned on since forever Jan, but usually, when its
>>>hung someplace, its well and truely hung, and hardware reset
>>>button time.
>>
>>Are you saying that these zombies (or tasks stuck in state D) also
>>make sysrq-T hang, and not list all tasks?
>
>
> I thought I'd test it right now while the system is runnng normally,
> but I got only a beep from the console, so I went to
> Documentation/sysrq.txt to make sure I was doing it right, and it is
> _not_ working right now. But it is compiled in according to a make
> xconfig, or a grep of the .config.
>
> [root@coyote linux-2.6.10-rc1-bk13]# grep SYSRQ .config
> CONFIG_MAGIC_SYSRQ=y
>
> I get a couple of beeps from the console, but thats the limit of the
> response, and a tail -f on the log shows nothing. I also logged into
> VC2, and tried it there, but that attempt didn't even get me a beep,
> several times.
>
> The keyboard is a cheap ($24) M$ with a few extra buttons that don't
> do anything along the top. And getting a bit creaky in its old age,
> a lot like me, but I'm about 68 years older than the keyboard :)
>
Don't need to log in, do need two hands to hit all the keys at once;-)
It works for me on a VC and unhung system, but I agree, when the system
is well and truly hung reset is the only thing left.


--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2004-11-04 20:54:00

by DervishD

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Hi Bill :)

* Bill Davidsen <[email protected]> dixit:
> > If init is the parent, all works ok, just wait a bit and all
> >those zombies will really die ;)
> Actually the ones in i/o probably won't, since the kernel either missed
> the completion or didn't time out if the hardware missed sending the
> int. And even plain non-i/o zombies, just how long "a bit" are you
> proposing?

A zombie *is already dead*, not stuck in some uninterruptible
queue in the kernel, so they will be ripped, sure. My last sentence
in the paragraph above may be confusing: when I said 'really die' I
meant 'be ripped'?

> Over Thanksgiving weekend I will try to look at the init code and see if
> a signal could be used to initiate a forced reap without waiting for the
> timer. By "look at" I mean not only "could I do that" but is it a good
> thing to do, before someone starts trying to explain that it's going to
> do something evil not to wait for the timer...

Don't look: just send SIGCHLD to init. That will do.

Ra?l N??ez de Arenas Coronado

--
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

2004-11-04 21:10:23

by DervishD

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Hi Bill :)

* Bill Davidsen <[email protected]> dixit:
> > If you are talking about others' children, then your call to
> >waitpid() (or wait()) failed with ECHILD: not your child.
> That's what happened when I tried it a few months ago. I suppose one
> could try sending a SIGCHLD to the parent and see if it does something
> helpful.

Probably it won't do. If the zombies are there due to a signal
delivery problem, sending a SIGCHLD to the parent will (probably)
solve the problem. But the common case is that the parent is screwed
up or simply so badly programmed that the only way of getting rid of
the zombies is to kill the parent...

Anyway I suppose that sending the SIGCHLD won't do any harm so it
may be worth trying.

Ra?l N??ez de Arenas Coronado

--
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

2004-11-04 22:39:00

by Peter Chubb

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

>>>>> "Matthias" == Matthias Andree <[email protected]> writes:

Matthias> On Wed, 03 Nov 2004, Gene Heskett wrote:
>> >Yes it does - the problem is that not all resources are managed
>> >by processes. Some allocations are managed by drivers, so a
>> driver >bug can get the device into a unuseable state _and_ tie up
>> the >process(es) that were using the driver at the moment.
>>
>> This from my viewpoint, is wrong. The kernel, and only the kernel
>> should be ultimately responsible for handing out resources, and
>> reclaiming at its convienience.

Matthias> Linux's driver model is the way it is. If you want the
Matthias> kernel to clean up after a driver has puked, you need
Matthias> something like a microkernel I believe, where only a minimal
Matthias> core kernel is a real kernel and where all the drivers are
Matthias> actually in user-space, but that's no longer Linux then.

Matthias> I'm not reflecting the down- and upsides to of this as I
Matthias> have no experience with microkernels (and have never used
Matthias> OS9 or GNU Hurd either). I know there have been attempts to
Matthias> port Linux to a Microkernel but I don't know what's come out
Matthias> of it.

There are actually several ports of Linux onto microkernels, but the
only one I know anything about is the Wombat project here at UNSW.

Linux running on the L4 microkernel runs at around the same speed as
on the bare metal. The home page is at
http://www.disy.cse.unsw.edu.au/Software/Wombat/ but there's not much
there yet.

--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
The technical we do immediately, the political takes *forever*

2004-11-04 23:06:32

by Helge Hafting

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thu, Nov 04, 2004 at 04:30:47PM +0000, Pedro Venda (SYSADM) wrote:
> Jim Nelson wrote:
> >DervishD wrote:
>
> the exact same happened to me, but my case was with ntfs. zip processes
> just got stuch in "D" state because of some unhandled names... i
> couldn't kill the processes. i don't think this is an easy thing to do,
> tough it should be possible to kill -9 these processes and make them exit.
>
> is this feasible?
>
The correct approach here is to fix ntfs so it doesn't make processes
wait forever for anything. There is no need for a workaround.

Helge Hafting

2004-11-04 23:33:56

by Benno

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thu Nov 04, 2004 at 11:07:49 +0100, Matthias Andree wrote:
>On Wed, 03 Nov 2004, Gene Heskett wrote:
>
>> >Yes it does - the problem is that not all resources are managed
>> >by processes. Some allocations are managed by drivers, so a driver
>> >bug can get the device into a unuseable state _and_ tie up the
>> >process(es) that were using the driver at the moment.
>>
>> This from my viewpoint, is wrong. The kernel, and only the kernel
>> should be ultimately responsible for handing out resources, and
>> reclaiming at its convienience.
>
>Linux's driver model is the way it is. If you want the kernel to clean
>up after a driver has puked, you need something like a microkernel I
>believe, where only a minimal core kernel is a real kernel and where all
>the drivers are actually in user-space, but that's no longer Linux then.

Of course some drivers are already in user-space on Linux. (E.g: X
graphics cards). Work by the Gelato project has added support to the
Linux kernel to allow more complicated drivers (e.g: those requiring
interrupts) to be run outside the kernel on Linux.

http://www.gelato.unsw.edu.au/cgi-bin/viewcvs.cgi/cvs/kernel/usrdrivers/

Cheers,

Benno

2004-11-05 00:29:45

by Gene Heskett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Wednesday 03 November 2004 15:48, Tom Felker wrote:
[...]
>> Isn't there some way to clean up a &^$#^#@)_ zombie?
>
>Ok, let me try to explain what probably happened.
>
>First, terminology. When one process wants to be come two
> processes, it fork()s. One process is the parent, and one it the
> child. The child usually exec()s to become a different program.
> The parent sometimes wants to know when the child ends and whether
> it succeeded. Thus, the wait() system calls. The parent can either
> check whether a child died, or go to sleep until one does. When
> the parent is awaken, it's told which child died and what the
> child's exit status was (usually 0 for success). But if the child
> dies before the parent wait()s, the kernel must keep a record of
> which child died and what its exit status was, and it can't
> reassign the late child's PID yet. This record is a "zombie," and
> shows up under top or ps with the 'Z' state. Zombies do _not_ hold
> open files, memory, or resources of any kind.
>
>That's the technical definition of a zombie, which I'm telling you
> because that's probably not your situation: I assume you used
> "zombie" as an informal term for a process that you can't kill.
> Your problem is a process in uninterruptible sleep (the "D" state).
>
>When a process executing in userspace wants information from a
> device, like a disk or TV capture card, it calls read(), and
> context switches into kernel space. Usually, it will take a moment
> for the data to be available from the device, so the process gets
> put on a wait queue so other processes can run. Obviously nothing
> is deallocated, because everyone expects the process will get it's
> data and proceed as normal. When the device has the data, it
> interrupts the CPU, and the kernel figures out who wanted the data
> and puts them on the run queue.
>
>When a process is on a wait queue waiting for data from a device
> (the D state), it's impossible to kill. This is because otherwise,
> when the interrupt did come, the structures associated with the
> process would have been freed, and the kernel would crash. It
> would require an incredible amount of innefficient bookkeeping to
> avoid this, and it's unnecessary because normally, the data request
> will finish (successfully or not), and the process will be woken
> up, or if it was sent SIGKILL, it will be killed.
>
>Long story short, what happened was, some faulty hardware or some
> buggy driver, probably associated with the capture card, had a
> problem and left the process in D state. Thus, it couldn't be
> killed, and since it had /dev/video open, tvtime couldn't run and
> failed gracefully, and because it held /dev/dsp open, and couldn't
> be killed as the init scripts would normally do in that situation,
> the audio drivers couldn't be unloaded and the boot process hung.
>
>So give us a bunch of information about what hardware you're using,
> output of dmesg, and steps to reproduce the driver bug (if it is
> that).

I cannot do that as it apparently was a transient thing. After the
reboot to the next kernel in the series, everythings has been working
as well as can be expected. I've listened to the radio for about 30
seconds, and the tv maybe 6 hours since.
Now that I know howto make the magic sysrq actually work and leave
meaningfull stuff in the logs, maybe I can report something that
might be constructive the next time it happens. Until then, I wait
for the other shoe I guess.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-11-05 02:41:10

by Elladan

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thu, Nov 04, 2004 at 08:39:34AM +0200, Denis Vlasenko wrote:
> On Thursday 04 November 2004 01:33, Russell Miller wrote:
> > On Wednesday 03 November 2004 17:03, Doug McNaught wrote:
> >
> > > It was already mentioned in this thread that the bookkeeping required
> > > to clean up properly from such an abort would add a lot of overhead
> > > and slow down the normal, non-buggy case.
> > >
> > I am going to continue pursuing this at the risk of making a bigger fool of
> > myself than I already am, but I want to make sure that I understand the
> > issues - and I did read the message you are referring to.
> >
> > I think what you are saying is that there is kind of a race condition here.
> > When something is on the wait queue, it has to be followed through to
> > completion. An interrupt could be received at any time, and if it's taken
> > off of the wait queue prematurely, it'll crash the kernel, because the
> > interrupt has no way of telling that.
>
> The problem is in locking. You must not kill process while it is
> in uninterruptible state because it is uninterruptible
> for a reason - has taken semaphore, or get_cpu(), etc.
> You do want it to do put_cpu(), right?
>
> Processes must never get stuck in D, it's a kernel bug.
>
> Find out how did process ended up in D state forever,
> and fix it - that's what I'm trying to do
> in these cases.

Perhaps it would be useful to add some debugging to the kernel for these
cases, somewhat akin to Ingo's preempt trace stuff?

If a process is in D state and receives a SIGKILL, assume it must exit
within a few seconds or it's a bug, and dump as much information about
it as is practical...?

-J

2004-11-05 03:12:50

by Tim Connors

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Elladan <[email protected]> said on Thu, 4 Nov 2004 18:38:50 -0800:
> If a process is in D state and receives a SIGKILL, assume it must exit
> within a few seconds or it's a bug, and dump as much information about
> it as is practical...?

Of course, it's not necessarily a bug. Someone could have just kicked
the ethernet, and so your process is stuck waiting for a read/write.

--
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
Theoretically one might have been wearing pants at work.
-- Anthony de Boer in Scary Devil Monastry

2004-11-05 03:14:13

by Russell Miller

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Thursday 04 November 2004 21:10, Tim Connors wrote:

> Of course, it's not necessarily a bug. Someone could have just kicked
> the ethernet, and so your process is stuck waiting for a read/write.

But it *is* a process hung in D state after you sent it a kill. It's safe to
assume, at least, that something is screwed up somewhere. More information
is always a good thing.

--Russell

--

Russell Miller - [email protected] - Le Mars, IA
Duskglow Consulting - Helping companies just like you to succeed for ~ 10 yrs.
http://www.duskglow.com - 712-546-5886

2004-11-05 04:41:02

by Elladan

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Fri, Nov 05, 2004 at 02:10:35PM +1100, Tim Connors wrote:
> Elladan <[email protected]> said on Thu, 4 Nov 2004 18:38:50 -0800:
> > If a process is in D state and receives a SIGKILL, assume it must exit
> > within a few seconds or it's a bug, and dump as much information about
> > it as is practical...?
>
> Of course, it's not necessarily a bug. Someone could have just kicked
> the ethernet, and so your process is stuck waiting for a read/write.

Sounds like a bug to me. Kernel resource leak due to network activity?

-J

2004-11-05 05:00:51

by Kyle Moffett

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

On Nov 04, 2004, at 22:10, Tim Connors wrote:
> Elladan <[email protected]> said on Thu, 4 Nov 2004 18:38:50 -0800:
>> If a process is in D state and receives a SIGKILL, assume it must exit
>> within a few seconds or it's a bug, and dump as much information about
>> it as is practical...?
>
> Of course, it's not necessarily a bug. Someone could have just kicked
> the ethernet, and so your process is stuck waiting for a read/write.

In any case, if a process is sleeping in-kernel, I expect that either
it's an
interruptible sleep or a guaranteed-short sleep. If it's neither, it's
a bug. If
I kick out an ethernet and it makes "ping" hang in "D", that's bad. I
think
that eventually _all_ kernel sleeps on the behalf of user-space
processes
will become interruptible.

Cheers,
Kyle Moffett

-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCM/CS/IT/U d- s++: a17 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$
L++++(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+
PGP+++ t+(+++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$ r
!y?(-)
------END GEEK CODE BLOCK------


2004-11-09 23:29:40

by Bill Davidsen

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

DervishD wrote:
> Hi Bill :)
>
> * Bill Davidsen <[email protected]> dixit:
>
>>> If you are talking about others' children, then your call to
>>>waitpid() (or wait()) failed with ECHILD: not your child.
>>
>>That's what happened when I tried it a few months ago. I suppose one
>>could try sending a SIGCHLD to the parent and see if it does something
>>helpful.
>
>
> Probably it won't do. If the zombies are there due to a signal
> delivery problem, sending a SIGCHLD to the parent will (probably)
> solve the problem. But the common case is that the parent is screwed
> up or simply so badly programmed that the only way of getting rid of
> the zombies is to kill the parent...

Wait a minute, in another message you just suggested that a SIGCHLD to
init would cause the status to be reaped.
>
> Anyway I suppose that sending the SIGCHLD won't do any harm so it
> may be worth trying.

It won't hurt init, but some processes do use the SIGCHLD to trigger a
wait(), which might hang the parent.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2004-11-10 09:29:44

by DervishD

[permalink] [raw]
Subject: Re: is killing zombies possible w/o a reboot?

Hi Bill :)

* Bill Davidsen <[email protected]> dixit:
> > Probably it won't do. If the zombies are there due to a signal
> >delivery problem, sending a SIGCHLD to the parent will (probably)
> >solve the problem. But the common case is that the parent is screwed
> >up or simply so badly programmed that the only way of getting rid of
> >the zombies is to kill the parent...
> Wait a minute, in another message you just suggested that a SIGCHLD to
> init would cause the status to be reaped.

I don't consider init the parent of such processes. It just
'adopts' them when the real parent doesn't care for them. I was
talking, in the paragraph above, about the *real* parent. I don't see
any contradiction, although sending SIGCHLD to a program that has not
waited for a children is risky: if the programmer was so clueless
that children were not waited for in the first place, chances are
that SIGCHLD handling is damaged, too.

> > Anyway I suppose that sending the SIGCHLD won't do any harm so it
> >may be worth trying.
> It won't hurt init, but some processes do use the SIGCHLD to trigger a
> wait(), which might hang the parent.

If a parent does 'wait()' instead of 'waitpid', that's lazy
programming. The signal won't hurt anyway: if the parent blocks (bug
in the program), then a 'kill -9' is the correct medication (it's
what I use for buggy programs), the children are reparented to init
and correctly handled (because a good init should, IMHO, use waitpid
instead of wait). Let's say that sending SIGCHLD is 'mostly harmless'
;))

Ra?l N??ez de Arenas Coronado

--
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/