Hi,
I'm trying to track down problems with NFS exports of a large XFS
partition that's only become apparent when the server went into production
(of course it didn't show up in 6 months of stability testing!).
Unfortunately, I don't know if the problem occurs on other filesystems as
I don't have any ext2/3 that I can put the same random load onto available
at this moment in time. We were running 2.2.18 + NFS patches over ext2
until we migrated to this server last Friday - so a lot has changed. It
looks like it may be similar to what is reported in the
http://marc.theaimsgroup.com/?t=101567659700002&r=1&w=2 thread.
The server is a Dual Xeon 1GHz/1GB (highmem enabled), running 2.4.18 w/XFS
(the 2.4.18-xfs-pr2 kernel). It has a large (500GB) partition on LVM (on
hardware raid 5, it's only on LVM to try to get snapshots down the track)
exported to a number of Solaris clients. I have tried kernels from the
SGI 2.4.9-31 one, through CVS various versions up to the 2.4.18 PR2, and
all display the same behaviour.
Basically, it seems that the server returns 'stale file handles' for files
that most definitely exist and aren't being touched in any way. A few
seconds later, the stale handles are gone. The server is fairly heavily
loaded during the day, and the problem occurs less overnight when there is
low load. It does not make a difference if I run a UP or SMP kernel.
At http://www.itee.uq.edu.au/~chrisp/NFS/, I have put the snoop output for
a few transactions that demonstrate the problem.
'bad_nfs.txt' is 'snoop -V' output which shows a few NFS transactions that
have gone to the NFS server, showing that a handle gets reported as stale
but is found immediately after by another 'find' run. (tcpdump on the
server itself doesn't report fsid/inode information..)
You can see FH=9543 (as Solaris snoop displays it) being returned by the
server as a result of a LOOKUP of the '.microsoft' directory, and then
used for ACCESS checks. The FH is supposed to be able to be used to map
back to an inode on the disk on a subsequent request.
Some time later (I usually wait between 600 and 1200 seconds between
runs), I run a find again, and it does a GETATTR3 on that handle to check
the current file attributes. At this time one sees the server report an
Illegal (stale) NFS handle back to the client, and the client forgets
about the handle.
A few seconds after this occurs, if I run another find, the client does
yet another LOOKUP3 to find it again (as it's forgotten about the handle
as a result of the stale handle), and it's successfully found, so the
client succeeds again.
In my find test, all the stale handles are returned for directories that I
have no access to the contents of; however we have seen stale handles
returned for binaries that are most definitely accessible to users (like
window managers, etc), and mail folders, etc.
Another file 'bad_nfs2.txt' that is a more verbose log of
another one that plays up. In this case, you can see a LOOKUP3 goes out;
the server responds with a file handle, and later when the Solaris client
tries to GETATTR3 the file again, it fails. Yet the file hasn't gone
anywhere, and another LOOKUP3 on it after a while works fine and returns
the same handle.
It all looks like the file info has fallen out of some kind of cache, and
the subsequent lookup helps it be found again. I'm not familiar enough
with the knfsd code (yet) to try to track this down myself; is there any
obvious debugging that I could/should turn on/add to see what's happening?
Thanks,
Chris
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Friday April 5, [email protected] wrote:
> Hi,
>
> I'm trying to track down problems with NFS exports of a large XFS
> partition that's only become apparent when the server went into production
> (of course it didn't show up in 6 months of stability testing!).
Isn't that always the way...
>
> In my find test, all the stale handles are returned for directories that I
> have no access to the contents of;
This seems to suggest that the problem happens when nfsd has to do a
lookup of ".." in a directory as it has to do when you access a
directory that has fallen out of cache (it has to put it backin cache
by finding a full path name).
However the lookup("..") happens below the normal permission checking
so this shouldn't be a problem.... I had a look at a couple of XFS
patches and they don't seem to do any extra permission checking so it
doesn't seem to be that...
> however we have seen stale handles
> returned for binaries that are most definitely accessible to users (like
> window managers, etc), and mail folders, etc.
If the problem is related to building a patch back to the root, then
you can remove the problem for non-directories by exporting with
"no_subtree_check".
NeilBrown
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
Hi Neil,
> However the lookup("..") happens below the normal permission checking
> so this shouldn't be a problem.... I had a look at a couple of XFS
> patches and they don't seem to do any extra permission checking so it
> doesn't seem to be that...
Yes, this is exactly the case. I tracked down yesterday where it happens to
be this case (XFS does some permission checks in xfs_lookup and that causes
the ".." lookup to fail) and let the XFS people know; they're looking into
it.
It still doesn't help me figure why some clients have got stale handles back
for files they should have been able to access though. Once this first
problem is sorted out I'll hopefully be able to increase the S/N ratio and
figure out if there's anything in common between the files the problems are
on (although, it might not be happening at all any more, I've had no
complaints today).
Regards,
Chris
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs