2014-07-02 01:05:34

by NeilBrown

[permalink] [raw]
Subject: NFSv4 open sequencing can lead to incorrect ESTALE


Hi,
The bash script below demonstrates a problem with NFSv4 - maybe two.
I'm testing on 3.16-rc2.

It mounts a local filesystem via NFS. The main thread then replaces
a particular file on the local file system and then accesses it via NFS,
checking that the contents are correct. It does this repeatedly.

In parallel a number of threads repeated open/close the same file (using
'grep' rather than 'cat' so tracing in the kernel can see which is which).

With NFSv4, none of these should ever get ESTALE. The server holds the
file open while the client has it open so even if it is unlinked in the
local filesystem, the client will still be able to access it.
Yet we do get ESTALE errors.

These are caused by the may_open() call in do_last().
may_open() calls inode_permission() which calls nfs_permission() which
performs an ACCESS call over NFS, which can get ESTALE. This error will
stop do_last() from calling finish_open() which, in the NFSv4 case, does the
final lookup and would for this test find the correct, non-stale, inode.

When may_open() and then do_last() return ESTALE, do_filp_open() will call
path_openat() again, this time with LOOKUP_REVAL, but that doesn't help.
nfs_permission only gets the inode and so cannot d_drop the dentry or
otherwise trigger a reval.

I added a d_drop() to may_open() when inode_permission() returns ESTALE and
the symptom went away, but I doubt that is the right thing to do.
For NFSv4 I think it is really best to leave all the work to the 'open' call
and not perform any access tests before hand. All the access tests should
happen inside the open call once the server knows that the file is 'open'.

I have no suggestion for how to fix this properly.


Once that bug is fixed, the script still shows unexpected behaviour. It
will eventually report that the file seen over NFS has the old content
instead of the new content.

This happens because the file (which has been unlinked on the server via a
rename) is still open by one of the background threads and so
can_open_cached() reports that "cat" doesn't need to actually open the file
- it can re-use the open that the 'grep' has.
This seems a little odd. It is a bit like treating an active 'open' of a
file as a mini-delegation, you don't need to open it again. However if it
was a real delegation, then when it was unlinked on the server the
delegation would be lost.
I tried running with "lookupcache=none" but I still get the same errors.
That certainly seems like an error. With lookupcache=none, opening a file
should check the name on the server, not assume that the name cached on the
client is correct. But that isn't what happens.

I have no idea how to fix this one either. I'm not even 100% sure which bit
is the bug, but something definitely seems wrong.
I changed can_open_cached() to always return 0 and the problem went away,
but again I don't think that is a correct fix.

Thanks,
NeilBrown


cnt=${1-10000}
local=${2-/export}
nfs=${3-/mnt}
max_errs=${4-1}

echo "using: $cnt $local $nfs $max_errs"
mount -o vers=4,lookupcache=none localhost:$local $nfs || { echo mount failed ; exit 1; }

rm -f $nfs/afile
touch $nfs/afile

for i in {1..5}; do while [ -f $nfs/afile ]; do grep . $nfs/afile > /dev/null 2>&1 ; done & done

i=0
err=0
while [ $i -lt $cnt ]; do
mydate=$(date +%s.%N)
want="$i $mydate"
echo $want > $local/bfile
mv $local/bfile $local/afile
have=`cat $nfs/afile`
if [ "$want" == "$have" ]; then
echo -n -e "$want\r"
else
echo fail > /dev/kmsg
echo "Wanted $want have $have."
for x in {1..1000}; do
sleep 0.1
grep "$mydate" $nfs/afile && { echo File now correct; break;}
done
let err=$err+1
if [ $err -ge $max_errs ]; then
echo
echo $err failures after $i attempts
rm $local/afile
umount $nfs
exit 1
fi
fi
let i=$i+1
done
rm $local/afile
echo
echo $err failures in $i attempts
sleep 2
umount $nfs
exit 0


Attachments:
signature.asc (828.00 B)