(Well, crud. I screwed up the previous diff and was missing a
close-curly. This version actually compiles...)
This patch fixes some coherence bugs in the NFS "dentry lookup cache".
The NFS dentry lookup cache provides the nfs_force_lookup_revalidate()
call to invalidate all cached dentries associated with an inode. In
general, the NFS client uses the ctime and mtime in the inode to detect
when changes are made on the server.
Therefore, to maintain cache coherence, nfs_force_lookup_revalidate()
must be called whenever the ctime or mtime of a directory inode is
updated with fresh data from the server. There are a few spots in
nfs_wcc_update_inode() where this rule is violated, making it possible
for the lookup cache to return arbitrarily stale data.
This actually bit me in practice. I have an application where a
negative dentry results in -ENOENT for a file that was created 30+
minutes earlier (despite the "noac" mount option). Unfortunately I
cannot share my test case, but I believe the following simple patch is
"obviously correct", and I can confirm that it fixes my issue.
CC: stable <[email protected]>
Signed-off-by: Patrick LoPresti <[email protected]>
---
--- linux-2.6.35/fs/nfs/inode.c.orig 2010-08-01 15:11:14.000000000 -0700
+++ linux-2.6.35/fs/nfs/inode.c 2010-08-11 22:18:30.000000000 -0700
@@ -819,21 +819,29 @@ static void nfs_wcc_update_inode(struct
&& (fattr->valid & NFS_ATTR_FATTR_CHANGE)
&& nfsi->change_attr == fattr->pre_change_attr) {
nfsi->change_attr = fattr->change_attr;
- if (S_ISDIR(inode->i_mode))
+ if (S_ISDIR(inode->i_mode)) {
nfsi->cache_validity |= NFS_INO_INVALID_DATA;
+ nfs_force_lookup_revalidate(inode);
+ }
}
/* If we have atomic WCC data, we may update some attributes */
if ((fattr->valid & NFS_ATTR_FATTR_PRECTIME)
&& (fattr->valid & NFS_ATTR_FATTR_CTIME)
- && timespec_equal(&inode->i_ctime, &fattr->pre_ctime))
- memcpy(&inode->i_ctime, &fattr->ctime, sizeof(inode->i_ctime));
+ && timespec_equal(&inode->i_ctime,
+ &fattr->pre_ctime)) {
+ if (S_ISDIR(inode->i_mode))
+ nfs_force_lookup_revalidate(inode);
+ memcpy(&inode->i_ctime, &fattr->ctime, sizeof(inode->i_ctime));
+ }
if ((fattr->valid & NFS_ATTR_FATTR_PREMTIME)
&& (fattr->valid & NFS_ATTR_FATTR_MTIME)
&& timespec_equal(&inode->i_mtime, &fattr->pre_mtime)) {
memcpy(&inode->i_mtime, &fattr->mtime, sizeof(inode->i_mtime));
- if (S_ISDIR(inode->i_mode))
+ if (S_ISDIR(inode->i_mode)) {
nfsi->cache_validity |= NFS_INO_INVALID_DATA;
+ nfs_force_lookup_revalidate(inode);
+ }
}
if ((fattr->valid & NFS_ATTR_FATTR_PRESIZE)
&& (fattr->valid & NFS_ATTR_FATTR_SIZE)
On Thu, Aug 12, 2010 at 8:49 AM, Trond Myklebust
<[email protected]> wrote:
>
> This looks less than obviously correct to me.
I just meant my patch obviously fixes a bug even if you do not believe
it is the bug I am encountering. I claim it is "obvious" based on:
Obvious statement #1: It is a bug if a directory inode's mtime is
EVER updated without invalidating the lookup cache, because that
results in a stale cache that does not know it is stale and can
persist indefinitely.
Obvious statement #2: The current nfs_wcc_update_inode() code updates
the directory inode's mtime without invalidating the lookup cache.
So the current code would be wrong even if it never created a problem
in practice. (Although it does create a problem in practice).
> The wcc case is invoked when the ctime/mtime/.... change is known to have occurred due to a file
> creation/unlink/... from this client. It is a weak cache consistency case.
>
> If your client is seeing ENOENT after it created the file itself, then
> the problem isn't cache coherency, but a bug in the file creation code.
The client did not create the file itself.
The sequence of operations goes like this:
1) Client looks up non-existent file, creates negative dentry.
2) File is created by a process on the server.
3) Something causes the client to update its mtime for the directory
without invalidating the dentry lookup cache. (I have not figured out
exactly what, but I have "caught it in the act"; see below).
4) Client incorrectly returns ENOENT when application next attempts to
access file.
You may think this is the usual "multiple updates within one second"
issue. That is what I thought at first, anyway. But I know it is
not, for three reasons...
First, my server is using XFS, which supports sub-second timestamps.
Second, I instrumented nfs_update_inode() to record the inode's mtime
and cache_change_attribute on function entry, then to log a message on
function return if the inode is a directory whose mtime got updated
without also updating cache_change_attribute. Then I reproduced my
issue. My new log message appeared precisely in those runs where the
issue occurred. (I also instrumented the code to follow the actual
sequence of events; I can see that the faulty ENOENT is being returned
because the inode's cache_change_attribute matches the stale dentry's
d_time.)
Third, my testing confirms that the issue disappears after I apply my
patch. If the problem were the usual "multiple updates in one
second", or indeed any server-side issue whatsoever, my patch could
not have fixed it.
Unfortunately, the only test case I have is my real-world application,
which I cannot share. It takes ~30 minutes of running to reproduce
every time, and it only happens on maybe 2/3 of the runs, so it has
taken me over a week to track this down. I still do not know the
exact sequence of operations that causes the problem; I just know it
really is happening.
In summary, the current code is "obviously wrong" because it violates
its own invariants by potentially updating a directory's mtime without
invalidating its lookup cache. And it really is happening to me in
practice. My patch fixes both the potential problem (which you can
see for yourself) and the actual problem (for which you have only my
word).
Are you unwilling to fix such a bug unless I can provide you a test case?
Thanks!
- Pat
On Wed, 2010-08-11 at 22:26 -0700, Patrick J. LoPresti wrote:
> (Well, crud. I screwed up the previous diff and was missing a
> close-curly. This version actually compiles...)
>
> This patch fixes some coherence bugs in the NFS "dentry lookup cache".
>
> The NFS dentry lookup cache provides the nfs_force_lookup_revalidate()
> call to invalidate all cached dentries associated with an inode. In
> general, the NFS client uses the ctime and mtime in the inode to detect
> when changes are made on the server.
>
> Therefore, to maintain cache coherence, nfs_force_lookup_revalidate()
> must be called whenever the ctime or mtime of a directory inode is
> updated with fresh data from the server. There are a few spots in
> nfs_wcc_update_inode() where this rule is violated, making it possible
> for the lookup cache to return arbitrarily stale data.
>
> This actually bit me in practice. I have an application where a
> negative dentry results in -ENOENT for a file that was created 30+
> minutes earlier (despite the "noac" mount option). Unfortunately I
> cannot share my test case, but I believe the following simple patch is
> "obviously correct", and I can confirm that it fixes my issue.
This looks less than obviously correct to me. The wcc case is invoked
when the ctime/mtime/.... change is known to have occurred due to a file
creation/unlink/... from this client. It is a weak cache consistency
case.
If your client is seeing ENOENT after it created the file itself, then
the problem isn't cache coherency, but a bug in the file creation code.
Cheers
Trond
On Fri, Aug 13, 2010 at 5:36 AM, J. Bruce Fields <[email protected]> wrote:
>
> If you have an easy reproducer it might also be worth experimenting with
> NFSv4 exports of an ext4 system mounted with the i_version option.
I may give that a try.
> Actually: should the NFSv4 server always be using i_version as the
> change attribute for directories? ?(Does every exportable filesystem
> update it on every directory modification?)
It appears to be an option for ext4. I do not see it referenced in XFS.
> (And if so, maybe on directories we should factor the i_version into the
> low bits of the mtime reported to NFSv3 client?)
Probably a bad idea. Runs the risk of "change file A then change file
B" resulting in backwards timestamps, which is even worse than 4
millisecond timestamps.
- Pat
On Thu, 2010-08-12 at 10:13 -0700, Patrick J. LoPresti wrote:
> On Thu, Aug 12, 2010 at 8:49 AM, Trond Myklebust
> <[email protected]> wrote:
> >
> > This looks less than obviously correct to me.
>
> I just meant my patch obviously fixes a bug even if you do not believe
> it is the bug I am encountering. I claim it is "obvious" based on:
>
> Obvious statement #1: It is a bug if a directory inode's mtime is
> EVER updated without invalidating the lookup cache, because that
> results in a stale cache that does not know it is stale and can
> persist indefinitely.
Wrong! Not if we _know_ that the mtime was updated due to an action we
took. We don't have to invalidate the lookup cache every time we create
a new dentry: we're quite able to add that dentry in to the cache
ourselves, and we do that.
> Obvious statement #2: The current nfs_wcc_update_inode() code updates
> the directory inode's mtime without invalidating the lookup cache.
>
> So the current code would be wrong even if it never created a problem
> in practice. (Although it does create a problem in practice).
The current code is not wrong.
> > The wcc case is invoked when the ctime/mtime/.... change is known to have occurred due to a file
> > creation/unlink/... from this client. It is a weak cache consistency case.
> >
> > If your client is seeing ENOENT after it created the file itself, then
> > the problem isn't cache coherency, but a bug in the file creation code.
>
> The client did not create the file itself.
So why is it going through the wcc case, and why does wcc think that the
client created the file?
> The sequence of operations goes like this:
>
> 1) Client looks up non-existent file, creates negative dentry.
> 2) File is created by a process on the server.
> 3) Something causes the client to update its mtime for the directory
> without invalidating the dentry lookup cache. (I have not figured out
> exactly what, but I have "caught it in the act"; see below).
Changing the wcc behaviour won't fix this.
> 4) Client incorrectly returns ENOENT when application next attempts to
> access file.
>
> You may think this is the usual "multiple updates within one second"
> issue. That is what I thought at first, anyway. But I know it is
> not, for three reasons...
>
> First, my server is using XFS, which supports sub-second timestamps.
>
> Second, I instrumented nfs_update_inode() to record the inode's mtime
> and cache_change_attribute on function entry, then to log a message on
> function return if the inode is a directory whose mtime got updated
> without also updating cache_change_attribute. Then I reproduced my
> issue. My new log message appeared precisely in those runs where the
> issue occurred. (I also instrumented the code to follow the actual
> sequence of events; I can see that the faulty ENOENT is being returned
> because the inode's cache_change_attribute matches the stale dentry's
> d_time.)
>
> Third, my testing confirms that the issue disappears after I apply my
> patch. If the problem were the usual "multiple updates in one
> second", or indeed any server-side issue whatsoever, my patch could
> not have fixed it.
Then you need to explain why. As I said, the wcc code path should not be
triggered when the client isn't creating or deleting stuff in the
directory. The WCC is there to tell you that your client and only your
client has changed stuff in the directory since it last revalidated the
mtime.
> Unfortunately, the only test case I have is my real-world application,
> which I cannot share. It takes ~30 minutes of running to reproduce
> every time, and it only happens on maybe 2/3 of the runs, so it has
> taken me over a week to track this down. I still do not know the
> exact sequence of operations that causes the problem; I just know it
> really is happening.
>
> In summary, the current code is "obviously wrong" because it violates
> its own invariants by potentially updating a directory's mtime without
> invalidating its lookup cache. And it really is happening to me in
> practice. My patch fixes both the potential problem (which you can
> see for yourself) and the actual problem (for which you have only my
> word).
I disagree.
> Are you unwilling to fix such a bug unless I can provide you a test case?
I can't take the patch as it stands. It is obviously wrong whether or
not it fixes your test case.
I'm happy to accept that there may be a bug, but you're going to have to
investigate further what is happening, and figure out why changing the
WCC code appears to fix the situation.
My hunch is that you are seeing a server bug rather than a client bug
here...
Cheers
Trond
On Thu, Aug 12, 2010 at 10:26 AM, Trond Myklebust
<[email protected]> wrote:
>
> My hunch is that you are seeing a server bug rather than a client bug
> here...
I believe I am close to finding the actual bug, but there is some code
in the NFS server that is confusing me.
Here is linux-2.6.35/fs/nfsd/vfs.c, lines 212-227 (in function
nfsd_lookup_dentry):
} else {
fh_lock(fhp);
dentry = lookup_one_len(name, dparent, len);
host_err = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto out_nfserr;
/*
* check if we have crossed a mount point ...
*/
if (nfsd_mountpoint(dentry, exp)) {
if ((host_err = nfsd_cross_mnt(rqstp, &dentry, &exp))) {
dput(dentry);
goto out_nfserr;
}
}
}
Shouldn't there be a call to "fh_unlock()" to match the call to
"fh_lock()"? Can anyone tell me where the corresponding fh_unlock()
call appears, or if it does not appear, why it is not needed?
Thanks...
- Pat
OK, I found the problem. Whether it is a bug depends on your point of
view, I suppose.
Although I am using XFS on my file server, and XFS has
nanosecond-granularity timestamps, the true granularity of ctime/mtime
is ultimately determined by the resolution of current_fs_time() which
calls current_kernel_time(); i.e. jiffies; i.e. 1/HZ.
On my system (SLES 11 SP1), HZ is 250. In my failing application, 4
ms is long enough for many filesystem operations, even over NFS. (My
network is 10GigE with a 300 microsecond round trip time, and my
systems are very new.)
Anyway, I instrumented the VFS code on the NFS server to catch it in
the act; specifically, I saw the following sequence:
file A is created on server, updating directory mtime
NFS client does LOOKUP on file B, gets nfserr_noent
file B created on server, does not update directory's mtime
...all within 4 milliseconds (which is why the creation of file B did
not update the directory's mtime).
The result is that the lookup cache on the client is stale and stays
stale until some other client (or the server) updates the directory.
Even making changes from the client does not invalidate the cache,
thanks to the clever WCC logic that Trond had to explain to me
earlier.
This is not exactly an NFS specific question, but I will ask anyway...
If I were to propose modifying current_fs_time() to call
getnstimeofday() instead of current_kernel_time(), would the VFS folks
laugh me out the door?
1/HZ granularity for file timestamps just seems so... 90s or
something. 4ms really is a lot of time these days.
- Pat
On Thu, Aug 12, 2010 at 10:26 AM, Trond Myklebust
<[email protected]> wrote:
>
> Wrong! Not if we _know_ that the mtime was updated due to an action we
> took. We don't have to invalidate the lookup cache every time we create
> a new dentry: we're quite able to add that dentry in to the cache
> ourselves, and we do that.
OK, now I see. That is the purpose of the "atomic update" checks;
i.e., seeing whether the ctime/mtime on the inode equals the
pre_ctime/pre_mtime in the fattr.
> I'm happy to accept that there may be a bug, but you're going to have to
> investigate further what is happening, and figure out why changing the
> WCC code appears to fix the situation.
Well, I know why my change fixes it: Because that code path is
updating the mtime in the inode to a value that matches the mtime on
the server even though the dentry lookup cache is actually out of
date.
However, it could have become out of date much earlier... And then
subsequent operations from the client "know" they are the ones
updating the mtime, thus preserving the stale cache indefinitely.
In other words, once my lookup cache gets into this bad state, it will
stay that way until some other client (or the server) updates the
directory. My patch flushes the cache even for operations that
originate on the client itself, thus working around the bug without
fixing it.
> My hunch is that you are seeing a server bug rather than a client bug
> here...
Yeah, assuming the "atomic action" logic is correct, I agree.
This also explains why the problem is so hard to reproduce. In my
application, the client checks for the existence of the file at almost
exactly the same time it is being created on the server. This may
well be triggering a race in the server that violates the atomicity
guarantees of NFS WCC. And once the cache becomes stale on my client,
it stays that way in spite of additional client-side directory
modifications.
Thank you for the quick replies. Obviously I need to investigate further.
- Pat
On Thu, Aug 12, 2010 at 10:16:57PM -0700, Patrick J. LoPresti wrote:
> OK, I found the problem. Whether it is a bug depends on your point of
> view, I suppose.
>
> Although I am using XFS on my file server, and XFS has
> nanosecond-granularity timestamps, the true granularity of ctime/mtime
> is ultimately determined by the resolution of current_fs_time() which
> calls current_kernel_time(); i.e. jiffies; i.e. 1/HZ.
>
> On my system (SLES 11 SP1), HZ is 250. In my failing application, 4
> ms is long enough for many filesystem operations, even over NFS. (My
> network is 10GigE with a 300 microsecond round trip time, and my
> systems are very new.)
>
> Anyway, I instrumented the VFS code on the NFS server to catch it in
> the act; specifically, I saw the following sequence:
>
> file A is created on server, updating directory mtime
> NFS client does LOOKUP on file B, gets nfserr_noent
> file B created on server, does not update directory's mtime
>
> ...all within 4 milliseconds (which is why the creation of file B did
> not update the directory's mtime).
>
> The result is that the lookup cache on the client is stale and stays
> stale until some other client (or the server) updates the directory.
> Even making changes from the client does not invalidate the cache,
> thanks to the clever WCC logic that Trond had to explain to me
> earlier.
>
> This is not exactly an NFS specific question, but I will ask anyway...
> If I were to propose modifying current_fs_time() to call
> getnstimeofday() instead of current_kernel_time(), would the VFS folks
> laugh me out the door?
Good question.... Cc'ing linux-fsdevel.
If you have an easy reproducer it might also be worth experimenting with
NFSv4 exports of an ext4 system mounted with the i_version option.
Actually: should the NFSv4 server always be using i_version as the
change attribute for directories? (Does every exportable filesystem
update it on every directory modification?)
(And if so, maybe on directories we should factor the i_version into the
low bits of the mtime reported to NFSv3 client?)
--b.
>
> 1/HZ granularity for file timestamps just seems so... 90s or
> something. 4ms really is a lot of time these days.
>
> - Pat
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html