2006-02-21 15:24:03

by David Golombek

[permalink] [raw]
Subject: 2.4.31 hangs, no information on console or serial port

I have a box running a modified Debian/woody system and 2.4.31. It is
intermittently hanging such that:

* All logging to /var/log ceases.
* Machine is still pingable.
* Machine can be telneted to on time port, but no time is echoed.
* After attaching a console+keyboard, console would not unblank.
* Nothing responded when attaching a serial console.
* Machine does not respond to Ctrl-Alt-Del
* No DMI messages are logged.
* Hang is persistent until physical reboot.

This has happened 4 times, on 2 separate machines (under roughly
similar conditions). Machines are up variable amounts of time before
crashing, between many weeks and less than 1 day. Nothing unusual is
logged in /var/log/{deamon.log,kern.log,messages,syslog} prior the
hang, except that /var/log/messages includes the "TCP: Treason
uncloaked!" warnings that are fixed in 2.4.32. No users were logged
on at the time of 3 of the 4 crashes, and no local user activity was
present at the time of the 4th.

The machines are Intel P4's with 2GB of memory

The machine is under relatively high load and has a custom userspace
nfs server running on it (which is potentially to blame, but we've
been unable to determine how). The custom userspace nfs server and
tomcat4 are the primary applications running.

Any suggestions as to how we might debug this or possible causes would
be greatly appreciated.

Thanks,
Dave


2006-02-21 15:34:43

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: 2.4.31 hangs, no information on console or serial port

On Tue, Feb 21, 2006 at 10:23:56AM -0500, David Golombek wrote:
> Any suggestions as to how we might debug this or possible causes would
> be greatly appreciated.

Have you tried turning on the NMI watchdog (nmi_watchdog=1)? It should
be able to kick the machine out of the locked state, as these symptoms
would hint at a spinlock deadlock with interrupts disabled. Also, try
to reproduce on the latest 2.4.33pre. That said, for an io intensive
workload like you're running, 2.6 is much better, especially for systems
using highmem.

-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.

2006-02-21 16:05:05

by David Golombek

[permalink] [raw]
Subject: Re: 2.4.31 hangs, no information on console or serial port

Benjamin LaHaise <[email protected]> writes:
> On Tue, Feb 21, 2006 at 10:23:56AM -0500, David Golombek wrote:
> > Any suggestions as to how we might debug this or possible causes would
> > be greatly appreciated.
>
> Have you tried turning on the NMI watchdog (nmi_watchdog=1)? It
> should be able to kick the machine out of the locked state, as these
> symptoms would hint at a spinlock deadlock with interrupts disabled.
> Also, try to reproduce on the latest 2.4.33pre. That said, for an
> io intensive workload like you're running, 2.6 is much better,
> especially for systems using highmem.

I'll enable nmi_watchdog as soon as we can bring the machine down,
thanks for the excellent suggestion. I'd entirely forgotten about the
watchdog. I'll try to switch to 2.4.33pre out as soon as poosible, it
certainly has several fixes we've been waiting for. 2.6 is still a
ways off, lots of qualification work to do.

Thanks,
Dave

2006-02-21 21:46:20

by Willy Tarreau

[permalink] [raw]
Subject: Re: 2.4.31 hangs, no information on console or serial port

On Tue, Feb 21, 2006 at 11:04:57AM -0500, David Golombek wrote:
> Benjamin LaHaise <[email protected]> writes:
> > On Tue, Feb 21, 2006 at 10:23:56AM -0500, David Golombek wrote:
> > > Any suggestions as to how we might debug this or possible causes would
> > > be greatly appreciated.
> >
> > Have you tried turning on the NMI watchdog (nmi_watchdog=1)? It
> > should be able to kick the machine out of the locked state, as these
> > symptoms would hint at a spinlock deadlock with interrupts disabled.
> > Also, try to reproduce on the latest 2.4.33pre. That said, for an
> > io intensive workload like you're running, 2.6 is much better,
> > especially for systems using highmem.
>
> I'll enable nmi_watchdog as soon as we can bring the machine down,
> thanks for the excellent suggestion. I'd entirely forgotten about the
> watchdog. I'll try to switch to 2.4.33pre out as soon as poosible, it
> certainly has several fixes we've been waiting for. 2.6 is still a
> ways off, lots of qualification work to do.

BTW, if your console blanks, you should use this :

# setterm -blank 0

Maybe you'll notice some "OOM: killing process" messages indicating
that some hungry process is going mad (possibly the NFS server).

> Thanks,
> Dave

Regards,
Willy

2006-02-27 16:24:22

by David Golombek

[permalink] [raw]
Subject: Re: 2.4.31 hangs, no information on console or serial port

> On Tue, Feb 21, 2006 at 10:23:56AM -0500, David Golombek wrote:
> > I have a box running a modified Debian/woody system and 2.4.31. It is
> > intermittently hanging such that:
> >
> > * All logging to /var/log ceases.
> > * Machine is still pingable.
> > * Machine can be telneted to on time port, but no time is echoed.
> > * After attaching a console+keyboard, console would not unblank.
> > * Nothing responded when attaching a serial console.
> > * Machine does not respond to Ctrl-Alt-Del
> > * No DMI messages are logged.
> > * Hang is persistent until physical reboot.
> >
> > This has happened 4 times, on 2 separate machines (under roughly
> > similar conditions). Machines are up variable amounts of time before
> > crashing, between many weeks and less than 1 day. Nothing unusual is
> > logged in /var/log/{deamon.log,kern.log,messages,syslog} prior the
> > hang, except that /var/log/messages includes the "TCP: Treason
> > uncloaked!" warnings that are fixed in 2.4.32. No users were logged
> > on at the time of 3 of the 4 crashes, and no local user activity was
> > present at the time of the 4th.
> >
> > The machines are Intel P4's with 2GB of memory
> >
> > The machine is under relatively high load and has a custom userspace
> > nfs server running on it (which is potentially to blame, but we've
> > been unable to determine how). The custom userspace nfs server and
> > tomcat4 are the primary applications running.
> >
> > Any suggestions as to how we might debug this or possible causes would
> > be greatly appreciated.
>
> Benjamin LaHaise <[email protected]> writes:
> Have you tried turning on the NMI watchdog (nmi_watchdog=1)? It
> should be able to kick the machine out of the locked state, as these
> symptoms would hint at a spinlock deadlock with interrupts disabled.
> Also, try to reproduce on the latest 2.4.33pre. That said, for an
> io intensive workload like you're running, 2.6 is much better,
> especially for systems using highmem.

After a week of intensive testing, we were finally able to reproduce
this hang. Sadly, the nmi watchdog did not appear to trigger (I'm
pretty sure it was configured correctly, I did see NMIs occurring).
No information appeared on serial or console (although this time they
weren't blanked). We're building 2.4.33pre kernel now to try and test
on now to see if we're still able to reproduce using it.

We're beginning to suspect that a hung loopback NFS mount might be to
blame, although we can't reproduce this trivially. Is there anyway in
which a mount that was behaving badly could affect the kernel in this
manner?

Dave

2006-02-27 16:45:33

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: 2.4.31 hangs, no information on console or serial port

On Mon, Feb 27, 2006 at 11:24:10AM -0500, David Golombek wrote:
> We're beginning to suspect that a hung loopback NFS mount might be to
> blame, although we can't reproduce this trivially. Is there anyway in
> which a mount that was behaving badly could affect the kernel in this
> manner?

Loopback NFS can deadlock in trying to free memory when writing back dirty
pages. Use mount --bind instead.

-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.

2006-02-27 17:48:14

by David Golombek

[permalink] [raw]
Subject: Re: 2.4.31 hangs, no information on console or serial port

Benjamin LaHaise <[email protected]> writes:
> On Mon, Feb 27, 2006 at 11:24:10AM -0500, David Golombek wrote:
> > We're beginning to suspect that a hung loopback NFS mount might be to
> > blame, although we can't reproduce this trivially. Is there anyway in
> > which a mount that was behaving badly could affect the kernel in this
> > manner?
>
> Loopback NFS can deadlock in trying to free memory when writing back
> dirty pages. Use mount --bind instead.

Unfortunately, --bind is not an option for us. The custom nfs-server
is actually a protocol adapter, mapping a custom filesystem spread
across a cluster of machines into NFS. We have the loopback mount in
order to provide CIFS access via samba. Looking at
http://www.ussg.iu.edu/hypermail/linux/kernel/0407.3/0297.html

it certainly does seem like we're susceptible to this failure and are
looking at memory usage at the time of the crash.

Thanks,
Dave