We've had a couple of customers complain about a performance regression
in a newer kernel on one of our older products. After some work we've
tracked it down to the fact that while fixing some cache consistency
problems, we erred on the side of too many cache invalidations. The
client is being too aggressive about invalidating its caches...
We have a patch in mainline kernels that adds the
nfs_post_op_update_inode_force_wcc() function to fake pre_op_attrs on
write operations. This works fairly well when the server doesn't return
pre_op_attrs. What I've found though is that when servers do return
pre_op_attrs on write calls, that they often return to the client poorly
ordered. This fools the client into thinking that there are other
writers on the file, even though it's the only one.
The following two patches attempt to fix this by:
1) faking pre_op_attrs on writes even when the server has actually
returned a valid set
2) not allowing post_op_attrs to update mtimes backward when we are
faking pre_op_attrs
I've tested this for performance with a simple iozone test on NFSv3. On
the client, mounting an older NetApp server, I'm running:
# time /opt/iozone/bin/iozone -ac -g 64M
Without these patches:
real 46m6.959s
user 0m1.509s
sys 23m40.880s
Client nfs v3:
null getattr setattr lookup access readlink
0 0% 1867 0% 198 0% 199 0% 319 0% 0 0%
read write create mkdir symlink mknod
83232 29% 195633 69% 198 0% 0 0% 0 0% 0 0%
remove rmdir rename link readdir readdirplus
198 0% 0 0% 0 0% 0 0% 0 0% 2 0%
fsstat fsinfo pathconf commit
0 0% 2 0% 0 0% 0 0%
...with these patches:
real 33m57.375s
user 0m1.471s
sys 12m59.619s
Client nfs v3:
null getattr setattr lookup access readlink
0 0% 1802 0% 198 0% 201 0% 206 0% 0 0%
read write create mkdir symlink mknod
0 0% 195836 98% 198 0% 0 0% 0 0% 0 0%
remove rmdir rename link readdir readdirplus
198 0% 0 0% 0 0% 0 0% 0 0% 0 0%
fsstat fsinfo pathconf commit
0 0% 2 0% 0 0% 0 0%
...no read calls on the second set, indicating that the patches drop
the number of cache invalidations to 0. Before that, reads accounted
for 29% of the calls and added over 12 mins to the run time.
I've also done a bit of light regression testing and haven't noticed
any major problems.
I suppose that this technically violates the RFC (esp patch 1), but
given that we're already faking pre_op_attrs when we don't have them,
is there any real harm in always doing this?
Thoughts?
Signed-off-by: Jeff Layton <[email protected]>
Write replies don't necessarily come in the same order that they are
committed to disk. It's easily possible for the wcc pre_op_attrs to not
match the current mtime on the file, even if there are no other writers.
When this happens, it can cause the pre_op_attr mtime to bounce around
which can confuse the client revalidation logic and make it invalidate
the cache, even when it's clearly the only writer to the file.
This patch attempts to reduce the amount of cache invalidations due to
post-op-attr updates by having post_op_update_inode_force_wcc() always
fake up wcc pre_op_attrs on write replies, even when the server has sent
us valid ones.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/inode.c | 6 ++----
1 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index a4c7cf2..e5dbe75 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -954,13 +954,11 @@ int nfs_post_op_update_inode(struct inode *inode, struct nfs_fattr *fattr)
*/
int nfs_post_op_update_inode_force_wcc(struct inode *inode, struct nfs_fattr *fattr)
{
- if ((fattr->valid & NFS_ATTR_FATTR_V4) != 0 &&
- (fattr->valid & NFS_ATTR_WCC_V4) == 0) {
+ if ((fattr->valid & NFS_ATTR_FATTR_V4) != 0) {
fattr->pre_change_attr = NFS_I(inode)->change_attr;
fattr->valid |= NFS_ATTR_WCC_V4;
}
- if ((fattr->valid & NFS_ATTR_FATTR) != 0 &&
- (fattr->valid & NFS_ATTR_WCC) == 0) {
+ if ((fattr->valid & NFS_ATTR_FATTR) != 0) {
memcpy(&fattr->pre_ctime, &inode->i_ctime, sizeof(fattr->pre_ctime));
memcpy(&fattr->pre_mtime, &inode->i_mtime, sizeof(fattr->pre_mtime));
fattr->pre_size = inode->i_size;
--
1.5.3.6
Since write replies don't necessarily come back in the order that they
were committed to disk, the post_op_attr mtime can sometimes jump
forward and backward. When this happens, it can make the client
invalidate the cache, even in situations when it's clear there are
no other writers.
In the situation where we're faking up pre_op_attrs, take extra care
to not allow the post_op_attrs to roll the mtime backward. If we get
a fattr with a mtime that's older the one currently on the inode,
assume that we've already previously updated the mtime and the one
currently on the inode is the correct one.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/inode.c | 13 ++++++++++---
include/linux/nfs_xdr.h | 1 +
2 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index e5dbe75..c502494 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -816,7 +816,14 @@ static void nfs_wcc_update_inode(struct inode *inode, struct nfs_fattr *fattr)
if (timespec_equal(&inode->i_ctime, &fattr->pre_ctime))
memcpy(&inode->i_ctime, &fattr->ctime, sizeof(inode->i_ctime));
if (timespec_equal(&inode->i_mtime, &fattr->pre_mtime)) {
- memcpy(&inode->i_mtime, &fattr->mtime, sizeof(inode->i_mtime));
+ if ((fattr->valid & NFS_ATTR_WCC_FAKE) &&
+ timespec_compare(&inode->i_mtime,
+ &fattr->mtime) > 0)
+ memcpy(&fattr->mtime, &inode->i_mtime,
+ sizeof(fattr->mtime));
+ else
+ memcpy(&inode->i_mtime, &fattr->mtime,
+ sizeof(inode->i_mtime));
if (S_ISDIR(inode->i_mode))
nfsi->cache_validity |= NFS_INO_INVALID_DATA;
}
@@ -956,13 +963,13 @@ int nfs_post_op_update_inode_force_wcc(struct inode *inode, struct nfs_fattr *fa
{
if ((fattr->valid & NFS_ATTR_FATTR_V4) != 0) {
fattr->pre_change_attr = NFS_I(inode)->change_attr;
- fattr->valid |= NFS_ATTR_WCC_V4;
+ fattr->valid |= (NFS_ATTR_WCC_V4|NFS_ATTR_WCC_FAKE);
}
if ((fattr->valid & NFS_ATTR_FATTR) != 0) {
memcpy(&fattr->pre_ctime, &inode->i_ctime, sizeof(fattr->pre_ctime));
memcpy(&fattr->pre_mtime, &inode->i_mtime, sizeof(fattr->pre_mtime));
fattr->pre_size = inode->i_size;
- fattr->valid |= NFS_ATTR_WCC;
+ fattr->valid |= (NFS_ATTR_WCC|NFS_ATTR_WCC_FAKE);
}
return nfs_post_op_update_inode(inode, fattr);
}
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index f301d0b..6715a67 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -64,6 +64,7 @@ struct nfs_fattr {
#define NFS_ATTR_FATTR_V4 0x0008 /* NFSv4 change attribute */
#define NFS_ATTR_WCC_V4 0x0010 /* pre-op change attribute */
#define NFS_ATTR_FATTR_V4_REFERRAL 0x0020 /* NFSv4 referral */
+#define NFS_ATTR_WCC_FAKE 0x0040 /* pre-op attrs are faked */
/*
* Info on the file system
--
1.5.3.6