I am working on a Linux based system and developing a monitoring process
which shall do the following function
(1) It will detect abnormally terminated application process and will
restart the process group
(2) It will detect a hanged application and will restart it
My query is regarding second point . What should be the proper
definition of a "Hanged Process" in Linux context . I searched on google
regarding it and got the following definitions
(1) A process not accepting any signals and consuming system resources
(2) A process in STOP state
(3) A process in deadlock state
Process conforming to definition 3 will be due to race conditions/bad
programming.Definition 1 does define a proper hanged process but is it
possible to create such a process in LInux as in linux signal delivery
to the process and its handling is assured by the Linux kernel.
Anybody having another definition for a "Hanged process" in Linux
context
Deepak Gaur
On Wed, 04 May 2005 14:38:48 +0900
"Deepak" <[email protected]> wrote:
> I am working on a Linux based system and developing a monitoring process
> which shall do the following function
>
> (1) It will detect abnormally terminated application process and will
> restart the process group
>
> (2) It will detect a hanged application and will restart it
>
> My query is regarding second point . What should be the proper
> definition of a "Hanged Process" in Linux context . I searched on google
> regarding it and got the following definitions
>
> (1) A process not accepting any signals and consuming system resources
> (2) A process in STOP state
> (3) A process in deadlock state
>
> Process conforming to definition 3 will be due to race conditions/bad
> programming.Definition 1 does define a proper hanged process but is it
> possible to create such a process in LInux as in linux signal delivery
> to the process and its handling is assured by the Linux kernel.
>
> Anybody having another definition for a "Hanged process" in Linux
> context
>
> Deepak Gaur
It is impossible to absolutely tell the difference between a very busy process and one
that is hung. Better to build to build hang detection into the process itself with
a heartbeat interface. Look at the hangcheck timer, nmi_watchdog, and watchdog
devices.
--
Stephen Hemminger <[email protected]>
On Wed, 04 May 2005 14:38:48 +0900, Deepak said:
> I am working on a Linux based system and developing a monitoring process
> which shall do the following function
> Anybody having another definition for a "Hanged process" in Linux
> context
Around here, the big issue is usually a process stuck in 'D' state - in
other words, a process that's done a syscall or otherwise entered the
kernel (page faults and AIO being other possibilities) and hasn't returned.
Since signals are delivered at return time, even a 'kill -9' wont do the
desired thing. These are almost always the result of either kernel bugs
or hardware failures.
There was a lengthy thread a while ago about how to deal with these, and the
consensus was that there's *NO* good general way to un-wedge such a process,
and that fixing the underlying bug or hardware fault is the only way to deal
with it.
On 5/4/05, Deepak <[email protected]> wrote:
> (1) A process not accepting any signals and consuming system resources
wrong. It may have decided to go through a critical section of its job, and
the job takes a long time.
> (2) A process in STOP state
wrong. The prosess is being debugged or is stopped (killall -STOP bash).
Besides, often the RUNNING state is more suspicious.
> (3) A process in deadlock state
how can you detect this from outside of the process(es)?!
> Process conforming to definition 3 will be due to race conditions/bad
> programming.Definition 1 does define a proper hanged process but is it
> possible to create such a process in LInux as in linux signal delivery
> to the process and its handling is assured by the Linux kernel.
Anything, except for SIGKILL and SIGSTOP can be overridden.
And it doesn't help you anyway in detecting of runaway processes.
> Anybody having another definition for a "Hanged process" in Linux
> context
Except for the case described by Valdis Klietnieks, it's hard to define.
How do you distinguish between a very busy process and the deadly
locked in itself one?
You'll probably end up defining some arbitrary timeouts for the
processes under your control, some watchdog interface for the
processes and plain old kill(..., SIGCONT); kill(..., SIGKILL); restart();