[ Feh, forgot to attach the damned file. ]
Folks,
Attached is a module, hangcheck-timer. It is used to detect
when the system goes out to lunch for a period of time, such as when a
driver like qla2x00 udelays a bunch.
The module sets a timer. When the timer goes off, it then uses
the TSC (warning: portability needed) to determine how much real time
has passed.
On a normal system, the real elapsed time will be almost
identical to the expected timer duration. However, if a device decided
to udelay for 60 seconds (or some other circumstance), the module takes
notice. If the margin of error passes a threshold, the machine is
rebooted.
The module is currently used in a cluster environment. After
some time out to lunch, the rest of the cluster will have given up on a
machine. If the machine suddenly comes back and assumes it is still
"live", bad things can happen.
We can also see use for this in a debugging sense, for kernel
hangs as well as driver code. That's why I'm proposing it for general
inclusion.
Comments? Thoughts?
Joel
Building:
The module should happily build against most 2.4 kernels. The
usual module building compile line:
gcc -I /scratch/jlbec/kernel/linux-2.4.20-rc2/include \
-DMODULE -D__KERNEL__ -DLINUX -c -o hangcheck-timer.o \
hangcheck-timer.c
Running:
Load the module with insmod. There are two options.
"hangcheck_tick=<seconds>" specifies the timer timeout, and
"hangcheck_margin=<seconds" specifies the margin of error.
--
"Friends may come and go, but enemies accumulate."
- Thomas Jones
Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: [email protected]
Phone: (650) 506-8127
Joel Becker wrote:
> Folks,
> Attached is a module, hangcheck-timer. It is used to detect
> when the system goes out to lunch for a period of time, such as when a
> driver like qla2x00 udelays a bunch.
> The module sets a timer. When the timer goes off, it then uses
> the TSC (warning: portability needed) to determine how much real time
> has passed.
> On a normal system, the real elapsed time will be almost
> identical to the expected timer duration. However, if a device decided
> to udelay for 60 seconds (or some other circumstance), the module takes
> notice. If the margin of error passes a threshold, the machine is
> rebooted.
> The module is currently used in a cluster environment. After
> some time out to lunch, the rest of the cluster will have given up on a
> machine. If the machine suddenly comes back and assumes it is still
> "live", bad things can happen.
> We can also see use for this in a debugging sense, for kernel
> hangs as well as driver code. That's why I'm proposing it for general
> inclusion.
> Comments? Thoughts?
>
> Joel
>
> Building:
> The module should happily build against most 2.4 kernels. The
> usual module building compile line:
> gcc -I /scratch/jlbec/kernel/linux-2.4.20-rc2/include \
> -DMODULE -D__KERNEL__ -DLINUX -c -o hangcheck-timer.o \
> hangcheck-timer.c
>
> Running:
> Load the module with insmod. There are two options.
> "hangcheck_tick=<seconds>" specifies the timer timeout, and
> "hangcheck_margin=<seconds" specifies the margin of error.
>
> Joel
>
There is already an NMI watchdog that is better than what you propose,
because it will also catch cases where something gets stuck with
interrupts disabled.
--
Brian Gerst
On Thu, Nov 21, 2002 at 03:31:04PM -0500, Brian Gerst wrote:
> Joel Becker wrote:
> > Attached is a module, hangcheck-timer. It is used to detect
> >when the system goes out to lunch for a period of time, such as when a
> >driver like qla2x00 udelays a bunch.
>
> There is already an NMI watchdog that is better than what you propose,
> because it will also catch cases where something gets stuck with
> interrupts disabled.
The issue at hand is not permanent hangs. The issue is hangs
that return. Consider a clustering enviornment where the other nodes
have given up on the delayed node and clean up after it. When the hang
finally ends, the node still thinks it is "alive" and happily scribbles
to places it shouldn't.
udelay will not ever trigger the NMI watchdog, as it is running
on the processor, so the cpu timer will run happily. But as far as
everything higher up (kernel + userspace), the delay will be unnoticed
and bad things can happen.
Joel
--
"I'm so tired of being tired,
Sure as night will follow day.
Most things I worry about
Never happen anyway."
Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: [email protected]
Phone: (650) 506-8127
On Thu, Nov 21, 2002 at 12:19:31PM -0800, Joel Becker wrote:
> it then uses the TSC (warning: portability needed)
ISTR get_cycles() being around, which should be defined for other arches.
Bill
> [ Feh, forgot to attach the damned file. ]
:-)
> The module is currently used in a cluster environment. After
> some time out to lunch, the rest of the cluster will have given up on a
> machine. If the machine suddenly comes back and assumes it is still
> "live", bad things can happen.
Would it make it more sense for other machines
to "kill" offending machine (cut power or press reset)?
--
Pavel
Written on sharp zaurus, because my Velo1 broke. If you have Velo you don't need...
On Tue, Nov 26, 2002 at 02:35:47PM +0100, Pavel Machek wrote:
> Would it make it more sense for other machines
> to "kill" offending machine (cut power or press reset)?
There is no solution that is general and inexpensive. STONITH
is as close as it gets, and we don't have support for that. On other
platforms where the shared disk is on FC, the device driver supports
fencing nodes from the switch.
That said, this module isn't exclusively useful to a
cluster+shared disk environment. If it were, I couldn't see generic
inclusion. This code is useful in many other situations.
Joel
--
Life's Little Instruction Book #313
"Never underestimate the power of love."
Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: [email protected]
Phone: (650) 506-8127