From: Tim Connors Subject: Re: why do i get "Stale NFS file handle" for hours? Date: Sun, 5 Sep 2004 18:17:53 +1000 Sender: linux-kernel-owner@vger.kernel.org Message-ID: References: <1094348385.13791.119.camel@lade.trondhjem.org> <413A7119.2090709@upb.de> <1094349744.13791.128.camel@lade.trondhjem.org> <413A789C.9000501@upb.de> <1094353267.13791.156.camel@lade.trondhjem.org> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Cc: Sven =?iso-8859-1?Q?K=F6hler?= , linux-kernel@vger.kernel.org, nfs@lists.sourceforge.net Return-path: To: Trond Myklebust In-reply-to: <1094353267.13791.156.camel@lade.trondhjem.org> List-ID: Trond Myklebust said on Sat, 04 Sep 2004 2= 3:01:07 -0400: > P=E5 lau , 04/09/2004 klokka 22:23, skreiv Sven K=F6hler: >=20 > > Sorry? Why is my server broken? I'm using kernel 2.6.8.1 with nfs-u= tils=20 > > 1.0.6 on my server, and i don't see, what should be broken. >=20 > When your server fails to work as per spec, then it is said to be > "broken" no matter what kernel/nfs-utils combination you are using. > The spec is that reboots are not supposed to clobber filehandles. >=20 > So, there are 3 possibilities: >=20 > 1) You are exporting a non-supported filesystem, (e.g. FAT). See the > FAQ on http://nfs.sourceforge.org. > 2) A bug in your initscripts is causing the table of exports to be > clobbered. Running "exportfs" in legacy 2.4 mode (without having the > nfsd filesystem mounted on /proc/fs/nfsd) appears to be broken for me= at > least... > 3) There is some other bug in knfsd that nobody else appears to be > seeing. Have I got 2 cases of 3) for you perhaps? I can't give you more info, because I am not the admin of the boxes concerned, but we lose filehandles of specific files and spontaneously sometimes (no server reboots, nfsd restarts, etc). Background: We have a compute cluster of machines all running SuSE's 2.4.20, or thereabouts. The nfs servers run Linus's 2.4.26, talking to ext3, on bigass apple Xserves. I will update one directory with rsync from one host, and then try, a little later on, to operate on that directory from another host. Every now and then, from a single host only, a few files in that tree will get stale filehandles - an ls of that directory will mostly be fine apart from those files. They will also be fine from any other machine. I have found that if I clobber cache with my alloclargemem program, then those files will come back immediately. The other problem we see regularly, and I have encoded explicitly into my scripts to workaround, because it is such a common occurence, is when I start 120 jobs in a short time on 120 nodes, which deal with a bunch of common files read-only, and then write their own private files, a few of them will die with the read-only files being stale. It looks as if the server just can't cope with a hundred requests (and possibly mounts, since they are automounted) in the space of half a minute (big files, mind you), and starts returning bogus data. The entire mount (which is automounted, looks like version 3) will then remain stale for eternity, with df returning its minus 3 bazillion GB free, until automount is restarted. Known problems? I googled for '"stale nfs file handle" spontaneous' with no luck. Or is it likely perhaps that SuSE fscked with the nfs (and autofs) client side code? The sysadmins look at these failures as being a fact of life, but perhaps no-one else is seeing this, so it's worth reporting. --=20 TimC -- http://astronomy.swin.edu.au/staff/tconnors/ PUBLIC NOTICE AS REQUIRED BY LAW: Any Use of This Product, in Any Manne= r=20 Whatsoever, Will Increase the Amount of Disorder in the Universe. Altho= ugh No=20 Liability Is Implied Herein, the Consumer Is Warned That This Process W= ill=20 Ultimately Lead to the Heat Death of the Universe.