2001-10-16 00:32:18

by Ryan Sweet

[permalink] [raw]
Subject: random reboots of diskless nodes - 2.4.7 (fwd)


I've posted about this problem before, but in the meantime I've managed to
test under several different configurations to help rule out some possible
causes.

Short version: 2.4.7 on nfsroot diskless nodes randomly re-boots and I
don't think it is a hardware problem or a problem with the server (which
is stable). Rather than "try this, try that..." I (and more importantly
my boss) would really like to find (and then hopefully fix) the root cause
of the problem.

See below for the long version and more details.

Questions:
- what the heck can I do to isolate the problem?
- why would the system re-boot instead of hanging on whatever caused it to
crash (ie, why don't I see an oops message?)
- how can I tell the system not to re-boot when it crashes (or is it
crashing at all???)
- is it worth trying all the newer kernel versions (this does not sound
very appealing, especially given the troubles reported with 2.4.10 and
also the split over which vm to use, etc..., also the changelogs don't
really point to anything that appears to precisely describe my problem)?
- if I patch with kgdb and use a null modem connection from the gateway to
run gdb can I expect to gain any useful info from a backtrace?

thanks for your help,
-Ryan Sweet,

- Long Version -

The problem:

At seemingly random intervals, one or more nodes of a diskless nfsroot
cluster will re-boot without being told to.... No information is available
on the console or in the logs. Sometimes the systems are up for a weeks,
sometimes it is only a few hours. Out of 23 nodes, we can pretty
reliably get at least one failure every couple of days or less. As the
number of nodes decreases, our average time to failure increases.

Notably, the problem happens far less frequently (but still occurs
eventually) when running kernels without SMP support. Also, 2.4.7 seems
to offer increased stability over previous versions, but still fails.

There seems to be no degradation of performance or measurable increase in
resource consumption or other activity before the re-boot occurs. Memory
pressure ddoes not seem to be a problem (of 768MB, we are never using more
than 256). lmsensors using the on-board hardware sensors does not record
any anomalies. In fact, performance on the system when it is not
re-booting is very good. A tcpdump shows that the last network activity
before the re-boot was an nfs file open, but since the systems are
nfsroot, that pretty much is the majority of the network activity.

--
The configuration:

Cluster Gateway: Single PIII-1Ghz 133Mhz FSB, 1GB RAM, dual 20GB IDE HD,
Mylex RAID Array, linux 2.4.6+sgi-xfs1.01, nfsutils-0.3.1, Intel eepro1000
Gigabit adapter connected to Farallon FastStarlet 10/100/1000 switch,
3com3c90x connected to the rest of the network. /tftpboot/<cluster ips>
and /data (examples) are exported in /etc/exports to all the cluster
nodes. iptables is setup to masquerade all traffic (mainly DNS and YP
lookups) from the cluster to the rest of the network. The kernel is
configured with the following support built-in, except as indicated):
sgi xfs 1.0.1
iptables+NAT
ext2
knfsdv3
DAC96x
3c90x
e1000 (module)
aic7xxx (module, used for tape device)

The root file system is on an ide disk using ext2, as is the tftpboot
area. The raid array is formatted with xfs and is mounted by each node
for use as data storage by the application.

The gateway node has been operating rock-solid stable since day one (about
90 days ago).

The compute nodes: Dual PIII-1Ghz 133Mhz FSB, Server works chipset, 768MB
RAM (some with 1GB RAM). Some nodes are using the Asus CUR_DLS
motherboard, others are using a SuperMicro motherboard, both boards have
the ServerWorks chipset. All on board features (apm, scsi, usb, ide...)
except video and the on-board eepro100 have been disabled. The systems
boot with nfsroot, mount several other partitions via nfs (for user dirs,
data area, etc...). They create a 64Mb ram disk partition and format it
with ext2 to use for /tmp (otherwise LAM MPI and some other things break).
The kernel is linux 2.4.7+lmsensors with support for the following (all
built-in):
smp (also tried UP, UP appears to be more stable, but still fails)
noapic
eepro100
knfsv3 client
nfsroot
ip auto-configuration
" " dhcp
ramdisk, size 65536
initrd
ext2
lmsensors

The nodes mount the root as nfsv2 because if I mount as v3 they
inevitably fall over and _don't re-boot_. All other mounts are as v3.

-- The troubleshooting (abridged version - believe me, you don't want the
full story):
- profiled network, cpu, memory utilisation, hardware sensors,
with no anomalies visible
- tested all minor versions on the nodes from 2.2.19-2.4.7,
including the redhat and suse supplied kernels for 2.4.3-xx
- upgraded bios on all nodes to the latest recommended by the
manufacturer
- upgraded (replaced) the power supply in all nodes
- tested power source to computer room, moved to another computer
room with better available power, etc...
- tested with a different switch
- tested with all combinations of nfsv2/3 for each mount point
- replaced all cables, verified all connections (re-seated all CPUS,
RAM)
- swapped the gigabit adapter
- the problem originally occurred with a brand new batch of
CUR_DLS servers that were all identical, however we ordered a second batch
of supermicro servers that have a different motherboard, case, cooling,
etc... (but still use the same ServerWorks chipset) and the problem also
occurs with them
- tried playing with the nfs rsize and wsize. While this affected
performance, it didn't seem to make a difference in the failures.
- upgraded util-linux, mount, and nfs-utils to current versions
(as of August 2001)




--
Ryan Sweet <[email protected]>
Atos Origin Engineering Services
http://www.aoes.nl



2001-10-16 04:58:30

by Keith Owens

[permalink] [raw]
Subject: Re: random reboots of diskless nodes - 2.4.7 (fwd)

On Tue, 16 Oct 2001 02:28:46 +0200 (CEST),
Ryan Sweet <[email protected]> wrote:
>Questions:
>- what the heck can I do to isolate the problem?

Debugger over a serial console.

>- why would the system re-boot instead of hanging on whatever caused it to
>crash (ie, why don't I see an oops message?)

Probably triple fault on ix86, which forces a reboot. That is, a fault
was detected, trying to report the fault caused an error which caused a
third error. Say goodnight, Dick. The other main possibility is a
hardware or software watchdog that thinks the system has hung and is
forcing a reboot, do you have one of those?

>- how can I tell the system not to re-boot when it crashes (or is it
>crashing at all???)

If it is a triple fault, you have to catch the error before the third
fault. Tricky.

>- is it worth trying all the newer kernel versions (this does not sound
>very appealing, especially given the troubles reported with 2.4.10 and
>also the split over which vm to use, etc..., also the changelogs don't
>really point to anything that appears to precisely describe my problem)?

Maybe. OTOH if you wait until you capture some diagnostics it will
give you a better indication if the later kernels actually fix the
problem.

>- if I patch with kgdb and use a null modem connection from the gateway to
>run gdb can I expect to gain any useful info from a backtrace?

It is definitely worth trying kgdb or kdb[1] over a serial console. I
am biased towards kdb (I maintain it) but either are worth a go.

Unfortunately the most common triple fault is a kernel stack overflow
and the ix86 kernel design has no way to recover from that error, the
error handler needs stack space to report anything, both kgdb and kdb
need stack space as well. If you suspect stack overflow, look at the
IKD patch[2], it has code to warn about potential stack overflows
before they are completely out of hand.

[1] ftp://oss.sgi.com/projects/kdb/download/ix86, old for 2.4.7.
[2] ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/ikd/

2001-10-16 08:08:56

by Alan

[permalink] [raw]
Subject: Re: random reboots of diskless nodes - 2.4.7 (fwd)

Serial console might help. When a box dies/reboots it may log something just
before it dies that you will see on a serial console on a second box. But
thats only a maybe.

Spontaneous reboots and hardware level freezes are about the worst things to
debug.

2001-10-16 20:05:57

by Hans-Peter Jansen

[permalink] [raw]
Subject: Re: random reboots of diskless nodes - 2.4.7 (fwd)

On Tuesday, 16. October 2001 02:28, Ryan Sweet wrote:

[..]

> The nodes mount the root as nfsv2 because if I mount as v3 they
> inevitably fall over and _don't re-boot_. All other mounts are as v3.

May have a look in this one, it took me some time to figure out, how
to mount root at v3 (",v3" appended to --rootdir option of mknbi-linux)
Since then, I see far less problems here.

FYI: I'm using 2.4.5 + trond's patches on a wide range (AMD K6/200, P2/350,
P3/600, AMD K7/1200, 128-768 MB, all asus + 3c90* + mga*) of diskless WSs
here. I'm writing this on a 1.2g with pristine 2.4.12 and it feels fine, too
(2d, 3:51 up). A couple of .9/.10-ac? releases gave some sluggish vm
behaviour, though. OTOH, I'm using reiserfs on the server (2.4.7) with some
patches, not xfs. But will give xfs a try on a server test config by the end
of this week on a dual K7 XP1700+ ;-)

To my experience, it's important to put enough ram into the boxes, since
I couldn't get them to swap via nbd. They usually run a full featured KDE2
environment, StarOffice 5.2, vmware, IOW: all this heavy stuff with heavy
regaled users ;-)

Server1:# uptime
9:54pm up 75 days, 3:24, 1 user, load average: 0.00, 0.00, 0.00
Server1:# uname -a
Linux xyzzy 2.4.6 #20 SMP Mon Jul 23 22:24:14 CEST 2001 i686 unknown
HW: Dual Intel P3-900/Serverworks OSB4/ICP Raid GDT 7x38RN/Yellofin Gnic2

Read you,
Hans-Peter

2001-10-19 15:32:40

by n0ano

[permalink] [raw]
Subject: Re: random reboots of diskless nodes - 2.4.7 (fwd)

My first guess would be power. You said you tested the power source.
Can you get ahold of a power line monitor with a strip chart recorder?
You might have a situation where the power is normally fine but for some
reason it could fluctuate at times and kick a machine into reset.
I assume you've eliminated the possibility of the janitor who unplugs
a machine to find an outlet for his floor polisher (don't laugh, it's
happened).

How's the temperature on the machines? Even if it's OK it would be
good to get another strip chart recorder on it to make sure the temp
stays within bounds 24hrs/day.

Also, do you have a serial console attached to each machine? This is
the only reliable way to make sure you have every console message that
came out right before the reboot.

On Tue, Oct 16, 2001 at 02:28:46AM +0200, Ryan Sweet wrote:
>
> I've posted about this problem before, but in the meantime I've managed to
> test under several different configurations to help rule out some possible
> causes.
>
> Short version: 2.4.7 on nfsroot diskless nodes randomly re-boots and I
> don't think it is a hardware problem or a problem with the server (which
> is stable). Rather than "try this, try that..." I (and more importantly
> my boss) would really like to find (and then hopefully fix) the root cause
> of the problem.
>
>
>...
>
> - upgraded (replaced) the power supply in all nodes
> - tested power source to computer room, moved to another computer
> room with better available power, etc...

--
Don Dugger
"Censeo Toto nos in Kansa esse decisse." - D. Gale
[email protected]
Ph: 303/652-0870x117

2001-11-05 14:49:25

by Ryan Sweet

[permalink] [raw]
Subject: Re: random reboots of diskless nodes - 2.4.7 (fwd)


Keith,
Regarding the message below - I've now reproduced the problem with both
2.4.7 and 2.4.13 each with the appropriate kdb patch applied. The trouble
is that I don't ever get a chance to break-in or do anything else with the
debugger - the system just restarts without complaining. Would this be
the triple fault scenario described below?

As for IKD, I am trying again with 2.4.7 and IKD now. I am wondering
though, will it do me any good if I don't catch the problem with my
eyeballs as it happens; I have oodles of nodes and the problem happens
on one of them at random. If I run on one node or two nodes it sometimes
runs for a week, and thus to increase my statistical sample (and to be
closer to the real usage), I have to test across a large subset of the
cluster, meaning that I can't watch 8-16 serial consoles at once.

thanks,
-Ryan Sweet

BTW - I tried using kdb for poking around at kernel internals on a
different system just for educational purposes and I wanted to say thanks
for such a great tool. It really helps to bridge the gap between the
source, gcc, as, and my generally useless lump of grey matter.

On Tue, 16 Oct 2001, Keith Owens wrote:

> On Tue, 16 Oct 2001 02:28:46 +0200 (CEST),
> Ryan Sweet <[email protected]> wrote:
> >Questions:
> >- what the heck can I do to isolate the problem?
>
> Debugger over a serial console.
>
> >- why would the system re-boot instead of hanging on whatever caused it to
> >crash (ie, why don't I see an oops message?)
>
> Probably triple fault on ix86, which forces a reboot. That is, a fault
> was detected, trying to report the fault caused an error which caused a
> third error. Say goodnight, Dick. The other main possibility is a
> hardware or software watchdog that thinks the system has hung and is
> forcing a reboot, do you have one of those?
>
> >- how can I tell the system not to re-boot when it crashes (or is it
> >crashing at all???)
>
> If it is a triple fault, you have to catch the error before the third
> fault. Tricky.
>
> >- is it worth trying all the newer kernel versions (this does not sound
> >very appealing, especially given the troubles reported with 2.4.10 and
> >also the split over which vm to use, etc..., also the changelogs don't
> >really point to anything that appears to precisely describe my problem)?
>
> Maybe. OTOH if you wait until you capture some diagnostics it will
> give you a better indication if the later kernels actually fix the
> problem.
>
> >- if I patch with kgdb and use a null modem connection from the gateway to
> >run gdb can I expect to gain any useful info from a backtrace?
>
> It is definitely worth trying kgdb or kdb[1] over a serial console. I
> am biased towards kdb (I maintain it) but either are worth a go.
>
> Unfortunately the most common triple fault is a kernel stack overflow
> and the ix86 kernel design has no way to recover from that error, the
> error handler needs stack space to report anything, both kgdb and kdb
> need stack space as well. If you suspect stack overflow, look at the
> IKD patch[2], it has code to warn about potential stack overflows
> before they are completely out of hand.
>
> [1] ftp://oss.sgi.com/projects/kdb/download/ix86, old for 2.4.7.
> [2] ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/ikd/
>

--
Ryan Sweet <[email protected]>
Atos Origin Engineering Services
http://www.aoes.nl