In testing, we're seeing a problem where an ls of a directory over NFSv3
reports "File not found" for one of the sub-directories. With some
tracing turned on, I can see that there is an -EBADCOOKIE error returned
from find_dirent(). The error is persistent; it doesn't go away with
time. I do have one report that says that by adding another file to the
directory, the error went away.
The directory structure it something like the following:
/mnt/lun43 is the mount point for the NFS share. /mnt/lun43/storage is
the working directory when ls is run. The "blobs" directory is listed
fine. However, the pool-0 directory reported "File not found", and
apparent has an bad cookie.
We'd be very grateful is someone with more knowledge of the Linux NFS
client could take a look at this. This problem has been reproducible
(it seems to elude me, but my colleagues can cause it somewhat readily).
I captured the network traffic from the ls command using tcpdump and
analyzed it with wireshark. First, there is an ACCESS RPC for
/mnt/lun43/storage. Then there is a GETATTR RPC /mnt/lun43, followed by
one for /mnt/lun43/storage. That's it; the rest must be pulled from the
client's cache, including the problematic cookie.
I have some debugging output available. http://n01se.net/paste/zDk is
the NFS debug output from running ls. inode #2 is the /mnt/lun43 mount
point. inode #48001 is the /mnt/lun43/storage directory.
I also have NFS and RPC debug output from when the problem was
reproduced on another client (and another server).
http://n01se.net/paste/qUc is the output from stat-ing the
storage/blobs/ directory under the mount point (/mnt/lun<something>/),
which behaves normally. inode #2 is the mount point. I believe inode
#80001 is the storage/ directory and inode #80002 is the storage/blobs/
directory. http://n01se.net/paste/rdL is the output from stat-ing the
storage/pool-0/ directory, which reports an error.
We do have one bit of speculation at to what could be causing the
problem, but we're newbies to the Linux NFS client code. What would
happen if the /mnt/lun43/storage/blobs directory was created, then the
client started reading the /mnt/lun43/storage/ directory, then the
/mnt/lun43/storage/pool-0 directory was created, all within the same
second (we are using ext3 on the server, so our mtime granularity is
only one second). Could that situation cause any problems?
I probably should have noted much earlier that we are using SLES10 SP1
x86_64, which is based on a 2.6.16 kernel.
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
On Thu, Jan 10, 2008 at 11:59:51PM -0500, Bob Bell wrote:
>In testing, we're seeing a problem where an ls of a directory over NFSv3
>reports "File not found" for one of the sub-directories. With some
>tracing turned on, I can see that there is an -EBADCOOKIE error returned
>from find_dirent(). The error is persistent; it doesn't go away with
>time. I do have one report that says that by adding another file to the
>directory, the error went away.
My apologies, apparently EBADCOOKIE has nothing to do with this. The
problem is the caching model and ext3's full-second timestamp
granularity. I've already discussed the issue with Trond.