2004-09-10 00:50:27

by John Nitis

[permalink] [raw]
Subject: RHEL3 Update 1 and NetApp NFS freeze issues

Greetings,

This is a bit of a stab in the dark but I thought this might be a good forum
to ask for input as it has both Linux NFS experts and a NetApp expert who
are regular contributors. We are unsure what the cause of the problem is
(even whether it's hardware or software) but one of our focuses is NFS
hangs.

Our problem is this, essentially we have an entire rack of machines (57 of
them) that lock up on a very regular basis. They respond to ping but do not
respond to telnet, ssh, etc. When you plug in a VGA monitor/PS2 keyboard
the screen pops up but you can't login. When you hit enter it just echoes
back a linefeed on the screen. A small percentage of them kernel panic.
They are all running RHEL3 WS with Update 1 installed and they're talking to
NetApp Filers with NFS via TCP. (Same thing happened with NFS via UDP,
however it's *somewhat* less frequent with NFS via TCP).

We have "top" and "ps augxww" output logging to a file once per minute and
some of them show excessive load averages before they freeze with many
processes stuck in D (uninterruptible sleep or disk wait). If you catch
these before the load average gets too high you can tell that a mount has
locked up (df hangs after displaying a few mounts and you can't access the
mount that's locked up). Each new process that gets stuck adds 1 to the
load average. The machine locks up in exactly the same way when we yank the
Ethernet cable from the box.

Previous to switching to NFS via TCP the "retrans" counter shown via nfsstat
was an extremely high percentage of total rpc calls. Our NetApps are
extremely busy with hundreds of Linux clients pounding them with
reads/writes. These machines are used in an EDA/engineering environment
where we have hundreds of Linux clients accessing a set of NetApp Filers via
Ethernet. In this case the machines which crash are Opteron boxes with
Broadcom NICs using the tg3 driver. We have RHEL3 Intel Xeon machines which
don't exhibit this problem as well as hundreds of RH7.2/7.3 clients which
don't exhibit the problem.

I've searched the list and seen much talk of NFS patches and whatnot, would
any of these be helpful to us?

Does anyone have any ideas as to what might be the problem or how we might
go about debugging it further? I've recently set the debugging levels to
"10" in /proc/sys/sunrpc/rpc_debug and /proc/sys/sunrpc/nfs_debug to see if
that will garner some information. A few details follow below.

Thanks very much,

John


OS version:

Red Hat Enterprise Linux WS release 3 (Taroon Update 1)

Kernel version:

Linux lscsbr3-1 2.4.21-9.ELsmp #1 SMP Thu Jan 8 16:52:31 EST 2004 x86_64
x86_64 x86_64 GNU/Linux

DataONTAP version:

6.4.1 and one filer running 6.5.1R1



-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM.
Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2004-09-10 01:52:17

by Trond Myklebust

[permalink] [raw]
Subject: Re: RHEL3 Update 1 and NetApp NFS freeze issues

P=E5 to , 09/09/2004 klokka 20:48, skreiv John Nitis:

> Our problem is this, essentially we have an entire rack of machines (57 o=
f
> them) that lock up on a very regular basis. They respond to ping but do =
not
> respond to telnet, ssh, etc. When you plug in a VGA monitor/PS2 keyboard
> the screen pops up but you can't login. When you hit enter it just echoe=
s
> back a linefeed on the screen. A small percentage of them kernel panic.
> They are all running RHEL3 WS with Update 1 installed and they're talking=
to
> NetApp Filers with NFS via TCP. (Same thing happened with NFS via UDP,
> however it's *somewhat* less frequent with NFS via TCP).
>=20
> We have "top" and "ps augxww" output logging to a file once per minute an=
d
> some of them show excessive load averages before they freeze with many
> processes stuck in D (uninterruptible sleep or disk wait). If you catch
> these before the load average gets too high you can tell that a mount has
> locked up (df hangs after displaying a few mounts and you can't access th=
e
> mount that's locked up). Each new process that gets stuck adds 1 to the
> load average. The machine locks up in exactly the same way when we yank =
the
> Ethernet cable from the box.
>=20
> Previous to switching to NFS via TCP the "retrans" counter shown via nfss=
tat
> was an extremely high percentage of total rpc calls. Our NetApps are
> extremely busy with hundreds of Linux clients pounding them with
> reads/writes. These machines are used in an EDA/engineering environment
> where we have hundreds of Linux clients accessing a set of NetApp Filers =
via
> Ethernet. In this case the machines which crash are Opteron boxes with
> Broadcom NICs using the tg3 driver. We have RHEL3 Intel Xeon machines wh=
ich
> don't exhibit this problem as well as hundreds of RH7.2/7.3 clients which
> don't exhibit the problem.

Hmm... Normally, a TCP mount should only retransmit very sporadically,
since the default value of timeo should be ~60 seconds. Are you perhaps
setting some value that is lower than the default?

Note that early versions of "amd" had a problem with the default value
too: they would set timeo=3D0.7 seconds irrespective of whether the client
was using UDP or TCP.

Cheers,
Trond



-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM.
Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-09-10 11:15:52

by Steve Dickson

[permalink] [raw]
Subject: Re: RHEL3 Update 1 and NetApp NFS freeze issues

John Nitis wrote:

>Greetings,
>
>This is a bit of a stab in the dark but I thought this might be a good forum
>to ask for input as it has both Linux NFS experts and a NetApp expert who
>are regular contributors. We are unsure what the cause of the problem is
>(even whether it's hardware or software) but one of our focuses is NFS
>hangs.
>
>Our problem is this, essentially we have an entire rack of machines (57 of
>them) that lock up on a very regular basis. They respond to ping but do not
>respond to telnet, ssh, etc. When you plug in a VGA monitor/PS2 keyboard
>the screen pops up but you can't login. When you hit enter it just echoes
>back a linefeed on the screen. A small percentage of them kernel panic
>
Set up netdump so wen an oops occurs, a system image (or core) will be
created. Then use the crash to examine the the core. This will give you
a wealth of information on what is going on in the system.
(Note: You'll have to install the correct kernel-debuginfo for this to
work).

When the system just hangs, make sure the Alt-SysRq keys are enabled
(by doing a "echo 1 > /proc/sys/kernel/sysrq"). Then use:
Alt-SysRq-p to see where the process(es) are doing
Alt-SysRq T to get system stack
Alt-SysRq M to memory information

>We have "top" and "ps augxww" output logging to a file once per minute and
>some of them show excessive load averages before they freeze with many
>processes stuck in D (uninterruptible sleep or disk wait). If you catch
>these before the load average gets too high you can tell that a mount has
>locked up (df hangs after displaying a few mounts and you can't access the
>mount that's locked up). Each new process that gets stuck adds 1 to the
>load average. The machine locks up in exactly the same way when we yank the
>Ethernet cable from the box.
>
>
Before things go south, does ifconfig ethX show any interface errors?

>Does anyone have any ideas as to what might be the problem or how we might
>go about debugging it further? I've recently set the debugging levels to
>"10" in /proc/sys/sunrpc/rpc_debug and /proc/sys/sunrpc/nfs_debug to see if
>that will garner some information. A few details follow below.
>
>
>
If your using autofs/amd (if you can) turn it off to see what happens.

I hope this helps....

SteveD.



-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM.
Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-09-10 14:30:48

by Lever, Charles

[permalink] [raw]
Subject: RE: RHEL3 Update 1 and NetApp NFS freeze issues

> Previous to switching to NFS via TCP the "retrans" counter=20
> shown via nfsstat was an extremely high percentage of total=20
> rpc calls.

so now that you are mounting using TCP, the retrans counter remains
flat?

with UDP, this could be either: the network environment is causing
packet loss, or the servers are often responding more slowly than the
estimated RPC RTT (which is typical for loaded filers).

both cases are helped by switching to TCP. i assume that's why you
switched. can you describe your network environment to us so we get a
fuller picture? link speeds, duplex setting, switch or router models,
etc.

> In this case the machines which crash=20
> are Opteron boxes with Broadcom NICs using the tg3 driver. =20
> We have RHEL3 Intel Xeon machines which don't exhibit this=20
> problem as well as hundreds of RH7.2/7.3 clients which don't=20
> exhibit the problem.

if only the Opteron/tg3 systems exhibit this problem, then i would
suggest hunting for a problem with the tg3 driver or for compatibility
issues between your NICs and your switch. are the Opteron systems SMP?

i second steve's recommendation to enable SysRq and capture a thread
traceback. there's already a line in /etc/sysctl.conf that you can use
to enable it automatically after every reboot.

# Controls the System Request debugging functionality of the kernel
=20
kernel.sysrq =3D 1

is what mine looks like.

> I've searched the list and seen much talk of NFS patches and=20
> whatnot, would any of these be helpful to us?

so far it sounds like a networking problem, not an NFS problem.


-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM.
Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-09-10 14:45:09

by Stuckless, Colin

[permalink] [raw]
Subject: RE: RHEL3 Update 1 and NetApp NFS freeze issues

> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]]
> Sent: Thursday, September 09, 2004 10:18 PM
> To: '[email protected]'
> Subject: [NFS] RHEL3 Update 1 and NetApp NFS freeze issues
>
>
> Greetings,
>
> This is a bit of a stab in the dark but I thought this might
> be a good forum to ask for input as it has both Linux NFS
> experts and a NetApp expert who are regular contributors. We
> are unsure what the cause of the problem is (even whether
> it's hardware or software) but one of our focuses is NFS hangs.
> ...
> Previous to switching to NFS via TCP the "retrans" counter
> shown via nfsstat was an extremely high percentage of total
> rpc calls. Our NetApps are extremely busy with hundreds of
> Linux clients pounding them with reads/writes. These
> machines are used in an EDA/engineering environment where we
> have hundreds of Linux clients accessing a set of NetApp
> Filers via Ethernet. In this case the machines which crash
> are Opteron boxes with Broadcom NICs using the tg3 driver.
> We have RHEL3 Intel Xeon machines which don't exhibit this
> problem as well as hundreds of RH7.2/7.3 clients which don't
> exhibit the problem.

What's the latest status of the tg3 driver vs. the bcm5700 driver for the
Broadcom? I know HP supplies the bcm5700 with their workstations/servers
that have Broadcom chipsets, and I've never had a problem with them here.

The tg3 driver is somewhat annoying in that it doesn't support parameters
via modules.conf to force certain settings (ethtool or mii-tool being the
preferred method I guess). I know I have seen complaints against the
stability of tg3 drivers in the past. This one from November 2002 is dated
now but might be worth looking at:
http://lists.us.dell.com/pipermail/linux-poweredge/2002-November/010486.html


-cjs


********************

This email communication is intended as a private communication for the sole
use of the primary addressee and those individuals listed for copies in the
original message. The information contained in this email is private and
confidential and If you are not an intended recipient you are hereby
notified that copying, forwarding or other dissemination or distribution of
this communication by any means is prohibited. If you are not specifically
authorized to receive this email and if you believe that you received it in
error please notify the original sender immediately. We honour similar
requests relating to the privacy of email communications.

Cette communication par courrier electronique est une communication privee a
l'usage exclusif du destinataire principal ainsi que des personnes dont les
noms figurent en copie. Les renseignements contenus dans ce courriel sont
confidentiels et si vous n'etes pas le destinataire prevu, vous etes avise,
par les presentes que toute reproduction, transfert ou autre forme de
diffusion de cette communication par quelque moyen que ce soit est
interdite. Si vous n'etes pas specifiquement autorise a recevoir ce
courriel ou si vous croyez l'avoir recu par erreur, veuillez en aviser
l'expediteur original immediatement. Nous respectons les demandes
similaires qui touchent la confidentialite des communications par courrier
electronique.