Hi everybody -- I'm really hoping someone can push me in the right
direction on this . . .
NFS server is a NetApp filer running OnTap 6.3.2
NFS clients are RedHat 9 boxes running linux 2.4.20.
We are using soft mounts with these options:
rw,soft,intr,nfsvers=3,wsize=32768,rsize=32768, \
proto=tcp,timeo=3,retrans=1,noac,sync
We are using these mount point options so that our application
(which constantly is writing to the NFS server) can detect
an NFS operation timeout after 0.9 seconds and fail over to local
disk to queue until the NFS server comes back.
Problem is -- is that is the NFS server is gone for any length of time,
the mount point doesn't "recover". df -k hangs forever, I can't
re-mount the mount point, and any processes that attempt to stat
or otherwise access the mount point are shown as being in an
"uninterruptable sleep" according to ps.
The only was I've been able to restore access to our mount point is to
reboot the clients that are hung. After enabling nfs/rpc debugging I'm
seeing this in /var/log/messages after attempting an NFS operation:
Jul 17 14:40:04 ts_10_2_20_22 kernel: RPC: 47394 reserved req f7aca7c8 xid
6e94a90f
Jul 17 14:40:04 ts_10_2_20_22 kernel: RPC: 47394 xprt_reserve returns 0
Jul 17 14:40:04 ts_10_2_20_22 kernel: RPC: 47395 xprt_reconnect f7aca000
connected 0
Jul 17 14:40:04 ts_10_2_20_22 kernel: RPC: 47395 TCP write queue full
According to ltrace and strace, processes hang in one of the following
locations:
__xstat64(3, "/corelog", 0x08058f74 <unfinished ...
stat64("/corelog", <unfinished ...>
statfs("/corelog", <unfinished ...>
Restarting nfslock and portmap have no effect. Note that I can mount the
same NFS share to a different location on the client and work from there
-- but in order to restore access to the original mount point I have to
reboot the server...
any ideas?
-------------------------------------------------------
This SF.net email is sponsored by: VM Ware
With VMware you can run multiple operating systems on a single machine.
WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the
same time. Free trial click here: http://www.vmware.com/wl/offer/345/0
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
hi antonio-
replying directly to [email protected] bounced,
so i'm sending to the list instead.
using a short timeout with TCP is probably a
bad idea, though it's doubtful that is the
root cause of your problem.
have you tried the same mount options but using
UDP instead?
can you send us a network trace? raw tcpdump
with snaplen of 1536 is preferred.
> -----Original Message-----
> From: [email protected] [mailto:[email protected]]=20
> Sent: Thursday, July 17, 2003 5:48 PM
> To: [email protected]
> Subject: [NFS] mount point not recovering after NFS server comes back
>=20
>=20
> Hi everybody -- I'm really hoping someone can push me in the right=20
> direction on this . . .
>=20
> NFS server is a NetApp filer running OnTap 6.3.2
> NFS clients are RedHat 9 boxes running linux 2.4.20.
>=20
> We are using soft mounts with these options:
> rw,soft,intr,nfsvers=3D3,wsize=3D32768,rsize=3D32768, \
> proto=3Dtcp,timeo=3D3,retrans=3D1,noac,sync
>=20
> We are using these mount point options so that our application
> (which constantly is writing to the NFS server) can detect
> an NFS operation timeout after 0.9 seconds and fail over to local
> disk to queue until the NFS server comes back.
>=20
> Problem is -- is that is the NFS server is gone for any=20
> length of time,
> the mount point doesn't "recover". df -k hangs forever, I can't
> re-mount the mount point, and any processes that attempt to stat
> or otherwise access the mount point are shown as being in an=20
> "uninterruptable sleep" according to ps.
>=20
> The only was I've been able to restore access to our mount=20
> point is to=20
> reboot the clients that are hung. After enabling nfs/rpc=20
> debugging I'm=20
> seeing this in /var/log/messages after attempting an NFS operation:
>=20
> Jul 17 14:40:04 ts_10_2_20_22 kernel: RPC: 47394 reserved req=20
> f7aca7c8 xid=20
> 6e94a90f
> Jul 17 14:40:04 ts_10_2_20_22 kernel: RPC: 47394 xprt_reserve=20
> returns 0
> Jul 17 14:40:04 ts_10_2_20_22 kernel: RPC: 47395=20
> xprt_reconnect f7aca000=20
> connected 0
> Jul 17 14:40:04 ts_10_2_20_22 kernel: RPC: 47395 TCP write queue full
>=20
>=20
> According to ltrace and strace, processes hang in one of the=20
> following=20
> locations:
>=20
> __xstat64(3, "/corelog", 0x08058f74 <unfinished ...
> stat64("/corelog", <unfinished ...>
> statfs("/corelog", <unfinished ...>
>=20
>=20
> Restarting nfslock and portmap have no effect. Note that I=20
> can mount the=20
> same NFS share to a different location on the client and work=20
> from there=20
> -- but in order to restore access to the original mount point=20
> I have to=20
> reboot the server...
>=20
> any ideas?
>=20
>=20
>=20
>=20
>=20
>=20
>=20
>=20
>=20
> -------------------------------------------------------
> This SF.net email is sponsored by: VM Ware
> With VMware you can run multiple operating systems on a=20
> single machine.
> WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual=20
> machines at the
> same time. Free trial click here: http://www.vmware.com/wl/offer/345/0
> _______________________________________________
> NFS maillist - [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs
>=20
-------------------------------------------------------
This SF.net email is sponsored by: VM Ware
With VMware you can run multiple operating systems on a single machine.
WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the
same time. Free trial click here: http://www.vmware.com/wl/offer/345/0
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs