Hi !
I'm running a compilation cluster with various machines now
on 2.4.1 all mounting the same home filesystem over NFS from
the central NFS server.
All machines are 2.4.1 using NFSv3, some SMP some UP.
NFS server is a dual running an SMP kernel
The NFS clients are getting
"Stale NFS handle"
messages every once in a while which will make a "touch somefile.o"
fail.
It's quite annoying and I didn't see it on 2.2 even after the NFS
patches were integrated.
What happens is that one machine will finish compiling, and another
machine will immediately thereafter do a "touch some_output.o". This
"touch" sometimes fails with a stale handle message.
Is this a known problem or should I submit more info ? If so,
what info ?
(no nothing is overclocked, yes I'm using RedHat 7 kgcc)
--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:
> The NFS clients are getting
> "Stale NFS handle"
> messages every once in a while which will make a "touch somefile.o"
> fail.
If they have the previous .o handle cached and it was removed on another
client thats quite reasonable behaviour. NFS isnt coherent
> It's quite annoying and I didn't see it on 2.2 even after the NFS
> patches were integrated.
I wonder if its because 2.4 runs faster and caches better 8). You can
tune the attribute cache times that may help. Are we talking 30 second
intervals here or stuff being cached for far too long (which would imply a bug)
On Tue, Feb 13, 2001 at 11:31:50PM +0000, Alan Cox wrote:
> > The NFS clients are getting
> > "Stale NFS handle"
> > messages every once in a while which will make a "touch somefile.o"
> > fail.
>
> If they have the previous .o handle cached and it was removed on another
> client thats quite reasonable behaviour. NFS isnt coherent
So a solution would be to
<local node> rm -f output.o
<remote node> g++ .... -o output.o
<local node> touch output.o
I do the touch in the first place because a subsequent link job on
another remote node used to fail if I didn't. NFS caching magic I guess...
>
> > It's quite annoying and I didn't see it on 2.2 even after the NFS
> > patches were integrated.
>
> I wonder if its because 2.4 runs faster and caches better 8). You can
> tune the attribute cache times that may help. Are we talking 30 second
> intervals here or stuff being cached for far too long (which would imply a bug)
It runs faster indeed, and it makes the work more fun and makes the
internet go faster ! (uhhh... or maybe the internet speedup is because
of the Intel CPUs - I forgot...)
Usually how a compile goes is:
a make -j10 spawns concurrent compile jobs. Each job consists of
"spawn on remote node" g++ ... -o somefile.o
"on local node" touch somefile.o
After a truckload of .o files have been generated, it will start
up link jobs, on other remote nodes. I haven't tried this without
the touch trick for a long time.
Each g++ job takes from a few seconds to several minutes depending
on the file and optimization options etc. I think I mainly see this
with the fast jobs... Most .o files are ~1-4 MB and I have about 200
of them.
~200 compilers and ~12 linkers take about 4-5 minutes to complete on
the cluster of three dual machines and two-three single cpus. Producing
about 1.5G of object code in total (yes C++ symbols are HUGE).
So, the touch is _immediately_ after a compile completion. But the
.o file has not been in use on any other machine than the one running
the compiler for hours or at least many minutes (a different compile).
I'll try this without the touch trick and see how things fare...
--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:
>>>>> " " == stergaard <Jakob> writes:
> What happens is that one machine will finish compiling, and
> another machine will immediately thereafter do a "touch
> some_output.o". This "touch" sometimes fails with a stale
> handle message.
Does the appended patch change anything?
Cheers,
Trond
--- linux-2.4.1/fs/nfs/inode.c.orig Tue Dec 12 02:46:04 2000
+++ linux-2.4.1/fs/nfs/inode.c Wed Feb 14 01:00:33 2001
@@ -100,6 +100,7 @@
inode->i_rdev = 0;
NFS_FILEID(inode) = 0;
NFS_FSID(inode) = 0;
+ NFS_FLAGS(inode) = 0;
INIT_LIST_HEAD(&inode->u.nfs_i.read);
INIT_LIST_HEAD(&inode->u.nfs_i.dirty);
INIT_LIST_HEAD(&inode->u.nfs_i.commit);
Alan Cox wrote:
> > The NFS clients are getting
> > "Stale NFS handle"
> > messages every once in a while which will make a "touch somefile.o"
> > fail.
>
> If they have the previous .o handle cached and it was removed on another
> client thats quite reasonable behaviour. NFS isnt coherent
As reported before, I see simliar stuff on an 2.2. SMP NFS client, and
an 2.2. NFS server.
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots.
* There are also old, bald pilots.
I have my home directory mounted on one computer from another. I
rebooted the server and now the client is saying Stale NFS file handle
anytime something goes to read my home directory. It has been this
way for about a day. Shouldn't any caches expire by now?
Both server and client are running 2.4.2.
I'ved tried `mount /home -o remount`, and reading lots of other
directories to flush out that entry if it was in cache without any
results.
I was hopping to avoid unmounting, as I would have to shut about
everything down to do that.
--
+---------------------------------+
| David Fries |
| [email protected] |
+---------------------------------+
On Sun, Feb 25, 2001 at 04:43:46PM +1100, Neil Brown wrote:
> On Saturday February 24, [email protected] wrote:
> > I have my home directory mounted on one computer from another. I
> > rebooted the server and now the client is saying Stale NFS file handle
> > anytime something goes to read my home directory. It has been this
> > way for about a day. Shouldn't any caches expire by now?
>
> It isn't that a cache needs to expire. It sounds like it is a cache
> that needs to be filled.
>
> The kernel keeps a cache of ip addresses that are allowed access to
> particular filesystems. This is visible through /proc/fs/nfs/exports.
> It is filled at reboot by "exportfs -a" or "exportfs -r" which gets
> information from /etc/exports and /var/lib/nfs/rmtab.
>
> So check that /etc/exports contains the right info.
> Check that /var/lib/nfs/rmtab lists the filesystems and clients that
> you expect to have access, and then run "exportfs -av"
checked, verified, re-exported, still Stale NFS file handle on client.
I also used tcpdump on server and when I do ls on my home directory
(this is where I see the Stale NFS message), it does not generate any
network traffic. It can't be the server if the client isn't asking
for it.
> > Both server and client are running 2.4.2.
> >
> > I'ved tried `mount /home -o remount`, and reading lots of other
> > directories to flush out that entry if it was in cache without any
> > results.
> >
> > I was hopping to avoid unmounting, as I would have to shut about
> > everything down to do that.
--
+---------------------------------+
| David Fries |
| [email protected] |
+---------------------------------+
>>>>> " " == David Fries <[email protected]> writes:
> I'ved tried `mount /home -o remount`, and reading lots of other
> directories to flush out that entry if it was in cache without
> any results.
> I was hopping to avoid unmounting, as I would have to shut
> about everything down to do that.
It looks as if you'll have to do that. 'mount -oremount' does not
really cause the root filehandle to get updated. The only thing it
does at the moment is allow you to change from a read-only to a
read-write filesystem.
What kind of filesystem is this BTW: is it an ext2 partition you are
exporting?
Cheers,
Trond
On Sun, Feb 25, 2001 at 08:25:10PM +1100, Neil Brown wrote:
> On Saturday February 24, [email protected] wrote:
> Verrry odd. I can see why you were suspecting a cache.
> I'm probably going to have to palm this off to Trond, the NFS client
> maintainer (are you listening Trond?) but could please confirm that
> from the client you can:
>
> 1/ ping server
> 2/ rpcinfo -p server
> 3/ showmount -e server
> 4/ mount server:/exported/filesys /some/other/mount/point
>
> If all of these work, them I am mistified. If one of these fails,
> then that might point the way to further investigation.
I have server:/home mounted on /home, the directory /home/david is the
mount file/directory on that mount that has a stale handle, everything
else on that mount point works including accessing any file under
/home/david.
I mounted it on a different directory and the new mount was fine, and
the problem directory on the new mount was fine, but the problem
directory on the old mount was still stale.
Yes it is a ext2 file system being exported. It is using the kernel
nfs server.
--
+---------------------------------+
| David Fries |
| [email protected] |
+---------------------------------+
On 25 Feb 2001, Trond Myklebust wrote:
> > I was hopping to avoid unmounting, as I would have to shut
> > about everything down to do that.
>
> It looks as if you'll have to do that. 'mount -oremount' does not
> really cause the root filehandle to get updated. The only thing it
> does at the moment is allow you to change from a read-only to a
> read-write filesystem.
A trick that works for me is mounting the NFS filesystem on another mount
point and unmounting it there. This usually makes the mount on the
original mount point magically work again.
cheers,
Lennert
On Mon, Feb 26, 2001 at 10:54:02AM +0100, Lennert Buytenhek wrote:
> A trick that works for me is mounting the NFS filesystem on another mount
> point and unmounting it there. This usually makes the mount on the
> original mount point magically work again.
Thinks, but I've tried it and it didn't work already.
--
+---------------------------------+
| David Fries |
| [email protected] |
+---------------------------------+
>>>>> " " == Neil Brown <[email protected]> writes:
> So... you can access things under /home/david, but you cannot
> access /home/david itself? So, supposing that "fred" were some
> file that you happen to know is in /home/david, then
> ls /home/david fails with ESTALE and does not cause
> any traffic to the server and
This is normal. Once an inode gets flagged as being stale, then it
remains stale. After all it would be a bug too if a filehandle were
stale one moment, and then not the next.
The question here is therefore really why did the server tell us that
the filehandle was stale in the first place.
Cheers,
Trond