In April I did some research on why synchronous NFSv3 performance on XFS is so
rotten when compared to local filesystem performance. The workload I chose to
work with is tar.
After taking some measurements I came to the conclusion that one of the big
problems is that we're not treating the log as stable storage. By calling
write_inode_now() we've written the changes to the log first and then gone and
also written them out to the inode on disk.
In a short discussion of this issue on the xfs-oss list it was suggested that I
post the patches here for discussion.
The following series is adds a 'wsync' export option to nfsd. It is intended
to be used on XFS with the wsync mount option. When you already have a
synchronous log there is no need to sync metadata separately.
This is barely tested, YMMV, I could have this all wrong, etc, etc. Here are
some very unscientific measurements taken over gigabit ethernet.
# time tar -xvf /mnt2/quilt-0.47.tar > /dev/null
No XFS wsync, no NFS wsync:
0m13.177s 0m13.301s 0m13.528s
XFS wsync set, no NFS wsync:
0m13.019s 0m13.232s 0m13.094s
XFS wsync set, NFS wsync set:
0m8.361s 0m8.400s 0m8.301s
Curious to hear if this is a reasonable thing to do. Suggestions welcome.
Thanks!
Ben
---
Ben Myers (4):
Add 'wsync' export option.
Add datasync argument to nfsd_sync_dir().
If 'wsync' call vfs_fsync() instead of write_inode_now().
If 'wsync' pass datasync=1 to vfs_fsync().
fs/nfsd/export.c | 1 +
fs/nfsd/nfs4recover.c | 2 +
fs/nfsd/vfs.c | 62 +++++++++++++++++++++++++++++++------------
fs/nfsd/vfs.h | 2 +
include/linux/nfsd/export.h | 4 ++-
5 files changed, 50 insertions(+), 21 deletions(-)
On Thu, Feb 04, 2010 at 10:30:06AM -0500, Christoph Hellwig wrote:
> On Wed, Feb 03, 2010 at 05:44:24PM -0600, Ben Myers wrote:
> > The following series is adds a 'wsync' export option to nfsd. It is intended
> > to be used on XFS with the wsync mount option. When you already have a
> > synchronous log there is no need to sync metadata separately.
>
> You don't need the xfs wsync option, as the existing write_inode calls
> or your new fsync calls are doing the same as the wsync mount option,
> just from a higher layer.
Ok, I see that now. write_inode_now is basically a superset of what we
need. The important thing is to force the log (which
xfs_file_operations.xfs_file_fsync does do) but not to call
xfs_super_operations.write_inode (via write_inode_now) which eventually
gets into xfs_iflush and takes forever. We can take forever later-- when
the log is full. ;)
> The wsync option causes the log to be synchronously forced up to the
> log sequence number of the commit for the metadata operation, that is
> make all the operations affected by it synchronous. That's exactly
> what we'll do using fsync (actually right now we force the whole log,
> but I have a patch to optimize it to only force nuntil the last commit
> lsn), and approqimately the same as we do using write_inode, just a
> lot less efficiently.
Rather than having X number of synchronous log transactions written
separately as with wsync you have X number of log transactions written
out together in one go by vfs_fsync (if datasync==0). That should be
faster than wsync.
> > Curious to hear if this is a reasonable thing to do. Suggestions
> > welcome.
>
> I think it's reasonable. What might be even better it to have an
> export operation call out into the filesystem so that we can force
> wsync and not let nfsd deal with it at all. There is a fair chance
> that the filesystem can do the sync more efficiently.
Trond also suggested an export_operation and I think it's a good idea.
I'll explore that and repost.
Thanks,
Ben
On Thu, Feb 04, 2010 at 12:15:29PM -0600, [email protected] wrote:
> Rather than having X number of synchronous log transactions written
> separately as with wsync you have X number of log transactions written
> out together in one go by vfs_fsync (if datasync==0). That should be
> faster than wsync.
Indeed, except that there aren't a lot of different transactions
usually.
- nfsd_setattr is one SETATTR transaction
- nfsd_create might be multiple transactions, indeed - especially
the nfsv3 variant that also adds a setattr transaction
- nfsd_link should be a single one
but yes, doing the log force from nfsd should be a benefit for
the create side at least. The additional benefit is that we can
just drive it from NFSD and don't need to force mount options on
the fs. So yes, let's do it from nfsd.
> Trond also suggested an export_operation and I think it's a good idea.
> I'll explore that and repost.
Indeed. For XFS that export_operation could probably be a lot
simpler than xfs_fsync. I don't think we need to catch the
non-transactional timestamp and size updates at all, and we're
guaranteed the transaction has already commited. So the
method might be as simple as:
static int xfs_nfs_force_inode(struct inode *inode)
{
struct xfs_inode *ip = XFS_I(inode);
xfs_ilock(ip, XFS_ILOCK_SHARED);
if (xfs_ipincount(ip)) {
xfs_lsn_t force_lsn = ip->i_itemp->ili_last_lsn;
ASSERT(force_lsn);
xfs_log_force_lsn(ip->i_mount, force_lsn, XFS_LOG_SYNC);
}
xfs_iunlock(ip, XFS_ILOCK_SHARED);
return 0;
}
On Thu, 2010-02-04 at 13:30 -0500, Christoph Hellwig wrote:
> On Thu, Feb 04, 2010 at 11:20:16AM -0600, [email protected] wrote:
> > > I think this is incorrect. datasync = 1 means to only write out
> > > metadata that is required to access the file data.
> >
> > My thinking there was that the XFS wsync mount option would have ensured
> > that the metadata that changed is already in the log. I wonder if when
> > using the wsync xfs mount option, could one drop the vfs_fsync
> > altogether? Looks to me like O_SYNC would also have taken care of any
> > metadata that is required to access the file data.
>
> Indeed, both write_inode_now and vfs_fsync will also cause data to
> be written. But my understanding of nfsd is that we manage the data
> writeout separately anyway and we care about the metadata here, which
> the placement of these calls would suggest:
>
> - nfsd_setattr for attribute updates
> - nfsd_create for creating a new file (of any type)
> - nfsd_link for adding a new link
Yes. Most operations in NFS are required to be synchronous (the only
exception being "unstable" write requests), and so those
fsync/write_inode_now calls are there in order to ensure that the
metadata and/or directory contents that were changed hits the disk
before the RPC call completes.
Cheers
Trond
On Thu, Feb 04, 2010 at 01:38:00PM -0500, Trond Myklebust wrote:
> > Indeed, both write_inode_now and vfs_fsync will also cause data to
> > be written. But my understanding of nfsd is that we manage the data
> > writeout separately anyway and we care about the metadata here, which
> > the placement of these calls would suggest:
> >
> > - nfsd_setattr for attribute updates
> > - nfsd_create for creating a new file (of any type)
> > - nfsd_link for adding a new link
>
> Yes. Most operations in NFS are required to be synchronous (the only
> exception being "unstable" write requests), and so those
> fsync/write_inode_now calls are there in order to ensure that the
> metadata and/or directory contents that were changed hits the disk
> before the RPC call completes.
Yeah. But currently both the fsync and write_inode_now calls will force
those unstable writes to disk. I'm not sure if that is an intentional
or unintentional side-effect of those metadata operations.
On Wed, Feb 03, 2010 at 05:44:24PM -0600, Ben Myers wrote:
> The following series is adds a 'wsync' export option to nfsd. It is intended
> to be used on XFS with the wsync mount option. When you already have a
> synchronous log there is no need to sync metadata separately.
You don't need the xfs wsync option, as the existing write_inode
calls or your new fsync calls are doing the same as the wsync mount
option, just from a higher layer. The wsync option causes the log
to be synchronously forced up to the log sequence number of the commit
for the metadata operation, that is make all the operations affected
by it synchronous. That's exactly what we'll do using fsync (actually
right now we force the whole log, but I have a patch to optimize it
to only force nuntil the last commit lsn), and approqimately the
same as we do using write_inode, just a lot less efficiently.
> This is barely tested, YMMV, I could have this all wrong, etc, etc. Here are
> some very unscientific measurements taken over gigabit ethernet.
>
> # time tar -xvf /mnt2/quilt-0.47.tar > /dev/null
>
> No XFS wsync, no NFS wsync:
> 0m13.177s 0m13.301s 0m13.528s
>
> XFS wsync set, no NFS wsync:
> 0m13.019s 0m13.232s 0m13.094s
>
> XFS wsync set, NFS wsync set:
> 0m8.361s 0m8.400s 0m8.301s
>
> Curious to hear if this is a reasonable thing to do. Suggestions welcome.
I think it's reasonable. What might be even better it to have an
export operation call out into the filesystem so that we can force wsync
and not let nfsd deal with it at all. There is a fair chance that the
filesystem can do the sync more efficiently.
And btw, not directly related to your patch, but getting rid of this
write_inode call defintively makes the usage of ->write_inode more
regular. From generic code we mostly use it for full inode writeback
in writeback_single_inode, and one other special case in
generic_detach_inode if a filesystem is unmounting. Only having
writeback_single_inode and fs code use it will make this call much
more predictable for the filesystem.
On Thu, 2010-02-04 at 13:40 -0500, Christoph Hellwig wrote:
> On Thu, Feb 04, 2010 at 01:38:00PM -0500, Trond Myklebust wrote:
> > > Indeed, both write_inode_now and vfs_fsync will also cause data to
> > > be written. But my understanding of nfsd is that we manage the data
> > > writeout separately anyway and we care about the metadata here, which
> > > the placement of these calls would suggest:
> > >
> > > - nfsd_setattr for attribute updates
> > > - nfsd_create for creating a new file (of any type)
> > > - nfsd_link for adding a new link
> >
> > Yes. Most operations in NFS are required to be synchronous (the only
> > exception being "unstable" write requests), and so those
> > fsync/write_inode_now calls are there in order to ensure that the
> > metadata and/or directory contents that were changed hits the disk
> > before the RPC call completes.
>
> Yeah. But currently both the fsync and write_inode_now calls will force
> those unstable writes to disk. I'm not sure if that is an intentional
> or unintentional side-effect of those metadata operations.
Syncing the unstable writes before the client sends a COMMIT request is
completely unnecessary. If we can avoid that, then that might indeed
give a performance boost for some workloads.
Cheers
Trond
On Thu, Feb 04, 2010 at 10:19:26AM -0500, Christoph Hellwig wrote:
> On Wed, Feb 03, 2010 at 05:44:44PM -0600, Ben Myers wrote:
> >
> > ---
> > fs/nfsd/vfs.c | 39 +++++++++++++++++++++++++--------------
> > 1 files changed, 25 insertions(+), 14 deletions(-)
> >
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 89eb1b2..4b1973b 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -418,7 +418,7 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
> > if (!err) {
> > if (EX_ISSYNC(fhp->fh_export)) {
> > if (EX_ISWSYNC(fhp->fh_export)) {
> > - vfs_fsync(NULL, dentry, 0);
> > + vfs_fsync(NULL, dentry, 1);
>
> I think this is incorrect. datasync = 1 means to only write out
> metadata that is required to access the file data.
My thinking there was that the XFS wsync mount option would have ensured
that the metadata that changed is already in the log. I wonder if when
using the wsync xfs mount option, could one drop the vfs_fsync
altogether? Looks to me like O_SYNC would also have taken care of any
metadata that is required to access the file data.
> The write_inode code
> makes sure all metadata goes out, which AFAIK is what the NFS protocol
> requires from us anyway for all these metadata operations.
I think only the data/metadata that changed for the rpc in question
needs to end up on stable storage.
> Note that this does not invalidate your XFS numbers as right now XFS
> doesn't actually implement the datasync == 1 optimization, although I'm
> about to send a patch to do that. The updsize is that this will
> make your patchset quite a bit simpler :)
Cool. ;)
-Ben
On Wed, Feb 03, 2010 at 05:44:44PM -0600, Ben Myers wrote:
>
> ---
> fs/nfsd/vfs.c | 39 +++++++++++++++++++++++++--------------
> 1 files changed, 25 insertions(+), 14 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 89eb1b2..4b1973b 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -418,7 +418,7 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
> if (!err) {
> if (EX_ISSYNC(fhp->fh_export)) {
> if (EX_ISWSYNC(fhp->fh_export)) {
> - vfs_fsync(NULL, dentry, 0);
> + vfs_fsync(NULL, dentry, 1);
I think this is incorrect. datasync = 1 means to only write out
metadata that is required to access the file data. The write_inode code
makes sure all metadata goes out, which AFAIK is what the NFS protocol
requires from us anyway for all these metadata operations.
Note that this does not invalidate your XFS numbers as right now XFS
doesn't actually implement the datasync == 1 optimization, although I'm
about to send a patch to do that. The updsize is that this will
make your patchset quite a bit simpler :)
On Wed, Feb 03, 2010 at 06:58:31PM -0500, Trond Myklebust wrote:
> Why should the administrator have to both change /etc/fstab
> and /etc/exports? That will be an immediate source of trouble if someone
> changes one without changing the other.
I agree, this should be left to the filesystems.
> Why not rather add an optional operation to the export_ops to let the
> filesystem specify exactly what kind of synchronisation policy is
> optimal for it?
Maybe. But I suspect fsync is actually the right thing to do for
all the filesystems. The only reason why we'd still might want to make
it an export operation is that I'd really prefer not to grow more
instances of fsync without the file pointer. Then again they're
already all in nfs and it doesn't matter too much to audit a few
more places later.
On Wed, 2010-02-03 at 17:44 -0600, Ben Myers wrote:
> In April I did some research on why synchronous NFSv3 performance on XFS is so
> rotten when compared to local filesystem performance. The workload I chose to
> work with is tar.
>
> After taking some measurements I came to the conclusion that one of the big
> problems is that we're not treating the log as stable storage. By calling
> write_inode_now() we've written the changes to the log first and then gone and
> also written them out to the inode on disk.
>
> In a short discussion of this issue on the xfs-oss list it was suggested that I
> post the patches here for discussion.
>
> The following series is adds a 'wsync' export option to nfsd. It is intended
> to be used on XFS with the wsync mount option. When you already have a
> synchronous log there is no need to sync metadata separately.
Why should the administrator have to both change /etc/fstab
and /etc/exports? That will be an immediate source of trouble if someone
changes one without changing the other.
Why not rather add an optional operation to the export_ops to let the
filesystem specify exactly what kind of synchronisation policy is
optimal for it?
Trond
On Thu, Feb 04, 2010 at 11:20:16AM -0600, [email protected] wrote:
> > I think this is incorrect. datasync = 1 means to only write out
> > metadata that is required to access the file data.
>
> My thinking there was that the XFS wsync mount option would have ensured
> that the metadata that changed is already in the log. I wonder if when
> using the wsync xfs mount option, could one drop the vfs_fsync
> altogether? Looks to me like O_SYNC would also have taken care of any
> metadata that is required to access the file data.
Indeed, both write_inode_now and vfs_fsync will also cause data to
be written. But my understanding of nfsd is that we manage the data
writeout separately anyway and we care about the metadata here, which
the placement of these calls would suggest:
- nfsd_setattr for attribute updates
- nfsd_create for creating a new file (of any type)
- nfsd_link for adding a new link
interestingly we use nfsd_sync_dir in all those same places, just for
the parent directory. So if we want to stick to also writing data
out in the child (which most of them time probably won't be dirty
anyway) we really should make sure to use the same method for both,
be that ->fsync or a new export operation.
---
fs/nfsd/export.c | 1 +
include/linux/nfsd/export.h | 4 +++-
2 files changed, 4 insertions(+), 1 deletions(-)
diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index c487810..170bf68 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -1428,6 +1428,7 @@ static struct flags {
{ NFSEXP_ROOTSQUASH, {"root_squash", "no_root_squash"}},
{ NFSEXP_ALLSQUASH, {"all_squash", ""}},
{ NFSEXP_ASYNC, {"async", "sync"}},
+ { NFSEXP_WSYNC, {"wsync", ""}},
{ NFSEXP_GATHERED_WRITES, {"wdelay", "no_wdelay"}},
{ NFSEXP_NOHIDE, {"nohide", ""}},
{ NFSEXP_CROSSMOUNT, {"crossmnt", ""}},
diff --git a/include/linux/nfsd/export.h b/include/linux/nfsd/export.h
index 8ae78a6..f6b7fc9 100644
--- a/include/linux/nfsd/export.h
+++ b/include/linux/nfsd/export.h
@@ -31,7 +31,8 @@
#define NFSEXP_ALLSQUASH 0x0008
#define NFSEXP_ASYNC 0x0010
#define NFSEXP_GATHERED_WRITES 0x0020
-/* 40 80 100 currently unused */
+#define NFSEXP_WSYNC 0x0040
+/* 80 100 currently unused */
#define NFSEXP_NOHIDE 0x0200
#define NFSEXP_NOSUBTREECHECK 0x0400
#define NFSEXP_NOAUTHNLM 0x0800 /* Don't authenticate NLM requests - just trust */
@@ -121,6 +122,7 @@ struct svc_expkey {
};
#define EX_ISSYNC(exp) (!((exp)->ex_flags & NFSEXP_ASYNC))
+#define EX_ISWSYNC(exp) (EX_ISSYNC(exp) && ((exp)->ex_flags & NFSEXP_WSYNC))
#define EX_NOHIDE(exp) ((exp)->ex_flags & NFSEXP_NOHIDE)
#define EX_WGATHER(exp) ((exp)->ex_flags & NFSEXP_GATHERED_WRITES)
---
fs/nfsd/nfs4recover.c | 2 +-
fs/nfsd/vfs.c | 18 +++++++++---------
fs/nfsd/vfs.h | 2 +-
3 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
index 5a754f7..e8c5938 100644
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -120,7 +120,7 @@ static void
nfsd4_sync_rec_dir(void)
{
mutex_lock(&rec_dir.dentry->d_inode->i_mutex);
- nfsd_sync_dir(rec_dir.dentry);
+ nfsd_sync_dir(rec_dir.dentry, 0);
mutex_unlock(&rec_dir.dentry->d_inode->i_mutex);
}
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 79d216f..1a64fb6 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -777,7 +777,7 @@ nfsd_close(struct file *filp)
* b) we expect to have i_mutex already held by the caller
*/
int
-nfsd_sync_dir(struct dentry *dentry)
+nfsd_sync_dir(struct dentry *dentry, int datasync)
{
struct inode *inode = dentry->d_inode;
int error;
@@ -786,7 +786,7 @@ nfsd_sync_dir(struct dentry *dentry)
error = filemap_write_and_wait(inode->i_mapping);
if (!error && inode->i_fop->fsync)
- error = inode->i_fop->fsync(NULL, dentry, 0);
+ error = inode->i_fop->fsync(NULL, dentry, datasync);
return error;
}
@@ -1322,7 +1322,7 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
}
if (EX_ISSYNC(fhp->fh_export)) {
- err = nfserrno(nfsd_sync_dir(dentry));
+ err = nfserrno(nfsd_sync_dir(dentry, 0));
write_inode_now(dchild->d_inode, 1);
}
@@ -1454,7 +1454,7 @@ nfsd_create_v3(struct svc_rqst *rqstp, struct svc_fh *fhp,
*created = 1;
if (EX_ISSYNC(fhp->fh_export)) {
- err = nfserrno(nfsd_sync_dir(dentry));
+ err = nfserrno(nfsd_sync_dir(dentry, 0));
/* setattr will sync the child (or not) */
}
@@ -1592,7 +1592,7 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (!host_err) {
if (EX_ISSYNC(fhp->fh_export))
- host_err = nfsd_sync_dir(dentry);
+ host_err = nfsd_sync_dir(dentry, 0);
}
err = nfserrno(host_err);
fh_unlock(fhp);
@@ -1657,7 +1657,7 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
host_err = vfs_link(dold, dirp, dnew);
if (!host_err) {
if (EX_ISSYNC(ffhp->fh_export)) {
- err = nfserrno(nfsd_sync_dir(ddir));
+ err = nfserrno(nfsd_sync_dir(ddir, 0));
write_inode_now(dest, 1);
}
err = 0;
@@ -1757,9 +1757,9 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
host_err = vfs_rename(fdir, odentry, tdir, ndentry);
if (!host_err && EX_ISSYNC(tfhp->fh_export)) {
- host_err = nfsd_sync_dir(tdentry);
+ host_err = nfsd_sync_dir(tdentry, 0);
if (!host_err)
- host_err = nfsd_sync_dir(fdentry);
+ host_err = nfsd_sync_dir(fdentry, 0);
}
mnt_drop_write(ffhp->fh_export->ex_path.mnt);
@@ -1843,7 +1843,7 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
if (host_err)
goto out_drop;
if (EX_ISSYNC(fhp->fh_export))
- host_err = nfsd_sync_dir(dentry);
+ host_err = nfsd_sync_dir(dentry, 0);
out_drop:
mnt_drop_write(fhp->fh_export->ex_path.mnt);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index 4b1de0a..89154a5 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -91,7 +91,7 @@ __be32 nfsd_statfs(struct svc_rqst *, struct svc_fh *,
int nfsd_notify_change(struct inode *, struct iattr *);
__be32 nfsd_permission(struct svc_rqst *, struct svc_export *,
struct dentry *, int);
-int nfsd_sync_dir(struct dentry *dp);
+int nfsd_sync_dir(struct dentry *dp, int datasync);
#if defined(CONFIG_NFSD_V2_ACL) || defined(CONFIG_NFSD_V3_ACL)
struct posix_acl *nfsd_get_posix_acl(struct svc_fh *, int);
---
fs/nfsd/vfs.c | 25 ++++++++++++++++++++-----
1 files changed, 20 insertions(+), 5 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 1a64fb6..89eb1b2 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -415,9 +415,16 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
}
if (size_change)
put_write_access(inode);
- if (!err)
- if (EX_ISSYNC(fhp->fh_export))
- write_inode_now(inode, 1);
+ if (!err) {
+ if (EX_ISSYNC(fhp->fh_export)) {
+ if (EX_ISWSYNC(fhp->fh_export)) {
+ vfs_fsync(NULL, dentry, 0);
+ } else {
+ write_inode_now(inode, 1);
+ }
+ }
+ }
+
out:
return err;
@@ -1323,7 +1330,11 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (EX_ISSYNC(fhp->fh_export)) {
err = nfserrno(nfsd_sync_dir(dentry, 0));
- write_inode_now(dchild->d_inode, 1);
+ if (EX_ISWSYNC(fhp->fh_export)) {
+ vfs_fsync(NULL, dchild, 0);
+ } else {
+ write_inode_now(dchild->d_inode, 1);
+ }
}
err2 = nfsd_create_setattr(rqstp, resfhp, iap);
@@ -1658,7 +1669,11 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
if (!host_err) {
if (EX_ISSYNC(ffhp->fh_export)) {
err = nfserrno(nfsd_sync_dir(ddir, 0));
- write_inode_now(dest, 1);
+ if (EX_ISWSYNC(ffhp->fh_export)) {
+ vfs_fsync(NULL, dold, 0);
+ } else {
+ write_inode_now(dest, 1);
+ }
}
err = 0;
} else {
---
fs/nfsd/vfs.c | 39 +++++++++++++++++++++++++--------------
1 files changed, 25 insertions(+), 14 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 89eb1b2..4b1973b 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -418,7 +418,7 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
if (!err) {
if (EX_ISSYNC(fhp->fh_export)) {
if (EX_ISWSYNC(fhp->fh_export)) {
- vfs_fsync(NULL, dentry, 0);
+ vfs_fsync(NULL, dentry, 1);
} else {
write_inode_now(inode, 1);
}
@@ -1169,7 +1169,8 @@ nfsd_commit(struct svc_rqst *rqstp, struct svc_fh *fhp,
return err;
if (EX_ISSYNC(fhp->fh_export)) {
if (file->f_op && file->f_op->fsync) {
- err = nfserrno(vfs_fsync(file, file->f_path.dentry, 0));
+ err = nfserrno(vfs_fsync(file, file->f_path.dentry,
+ EX_ISWSYNC(fhp->fh_export) ? 1 : 0));
} else {
err = nfserr_notsupp;
}
@@ -1329,9 +1330,10 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
}
if (EX_ISSYNC(fhp->fh_export)) {
- err = nfserrno(nfsd_sync_dir(dentry, 0));
+ err = nfserrno(nfsd_sync_dir(dentry,
+ EX_ISWSYNC(fhp->fh_export) ? 1 : 0));
if (EX_ISWSYNC(fhp->fh_export)) {
- vfs_fsync(NULL, dchild, 0);
+ vfs_fsync(NULL, dchild, 1);
} else {
write_inode_now(dchild->d_inode, 1);
}
@@ -1465,7 +1467,8 @@ nfsd_create_v3(struct svc_rqst *rqstp, struct svc_fh *fhp,
*created = 1;
if (EX_ISSYNC(fhp->fh_export)) {
- err = nfserrno(nfsd_sync_dir(dentry, 0));
+ err = nfserrno(nfsd_sync_dir(dentry,
+ EX_ISWSYNC(fhp->fh_export) ? 1 : 0));
/* setattr will sync the child (or not) */
}
@@ -1602,8 +1605,10 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
host_err = vfs_symlink(dentry->d_inode, dnew, path);
if (!host_err) {
- if (EX_ISSYNC(fhp->fh_export))
- host_err = nfsd_sync_dir(dentry, 0);
+ if (EX_ISSYNC(fhp->fh_export)) {
+ host_err = nfsd_sync_dir(dentry,
+ EX_ISWSYNC(fhp->fh_export) ? 1 : 0);
+ }
}
err = nfserrno(host_err);
fh_unlock(fhp);
@@ -1668,9 +1673,10 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
host_err = vfs_link(dold, dirp, dnew);
if (!host_err) {
if (EX_ISSYNC(ffhp->fh_export)) {
- err = nfserrno(nfsd_sync_dir(ddir, 0));
+ err = nfserrno(nfsd_sync_dir(ddir,
+ EX_ISWSYNC(ffhp->fh_export) ? 1 : 0));
if (EX_ISWSYNC(ffhp->fh_export)) {
- vfs_fsync(NULL, dold, 0);
+ vfs_fsync(NULL, dold, 1);
} else {
write_inode_now(dest, 1);
}
@@ -1772,9 +1778,12 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
host_err = vfs_rename(fdir, odentry, tdir, ndentry);
if (!host_err && EX_ISSYNC(tfhp->fh_export)) {
- host_err = nfsd_sync_dir(tdentry, 0);
- if (!host_err)
- host_err = nfsd_sync_dir(fdentry, 0);
+ host_err = nfsd_sync_dir(tdentry,
+ EX_ISWSYNC(tfhp->fh_export) ? 1 : 0);
+ if (!host_err) {
+ host_err = nfsd_sync_dir(fdentry,
+ EX_ISWSYNC(tfhp->fh_export) ? 1 : 0);
+ }
}
mnt_drop_write(ffhp->fh_export->ex_path.mnt);
@@ -1857,8 +1866,10 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
if (host_err)
goto out_drop;
- if (EX_ISSYNC(fhp->fh_export))
- host_err = nfsd_sync_dir(dentry, 0);
+ if (EX_ISSYNC(fhp->fh_export)) {
+ host_err = nfsd_sync_dir(dentry,
+ EX_ISWSYNC(fhp->fh_export) ? 1 : 0);
+ }
out_drop:
mnt_drop_write(fhp->fh_export->ex_path.mnt);