LinuxLists.cc - [RFC PATCH 0/2] nfsd sync export

2010-02-10 00:33:28

Subject: [RFC PATCH 0/2] nfsd sync export_op (was 'wsync export option')

This is another try based upon Christoph and Trond's suggestions in the 'wsync
export option' thread. I've added the export operation and combined
nfsd_sync_dir and write_inode_now usage into nfsd_sync2 which passes the buck
onto XFS. It's been very lightly tested. It's just a little bit faster than
the previous try:

# time tar -xvf /mnt2/quilt-0.47.tar > /dev/null

plain jane:
0m13.177s 0m13.301s 0m13.528s

previous try:
0m8.361s 0m8.400s 0m8.301s

w/ xfs commit_metadata op:
0m7.426s 0m7.340s 0m7.198s

Thanks!
-Ben

---

Ben Myers (2):
commit_metadata export operation and nfsd_sync2
xfs_export_operations.commit_metadata

fs/nfsd/nfs3proc.c | 2 -
fs/nfsd/nfs4proc.c | 2 -
fs/nfsd/nfs4recover.c | 2 -
fs/nfsd/nfs4state.c | 2 -
fs/nfsd/nfsproc.c | 4 +
fs/nfsd/vfs.c | 113 ++++++++++++++++++++++++++---------------
fs/nfsd/vfs.h | 4 +
fs/xfs/linux-2.6/xfs_export.c | 64 +++++++++++++++++++++++
include/linux/exportfs.h | 6 ++
9 files changed, 151 insertions(+), 48 deletions(-)

--

2010-02-10 00:33:33

by Ben Myers

[permalink] [raw]

Subject: [RFC PATCH 1/2] commit_metadata export operation and nfsd_sync2

- Add an new export_operation 'commit_metadata' used to commit two inodes to
stable storage.

- Combine usage of nfsd_sync_dir and write_inode_now into nfsd_sync2 taking
parent and child dentries to be committed.

- Add an arg to nfsd_setattr so that we can delay commiting changes in
situations where it might be beneficial for the caller to do so instead.
---
fs/nfsd/nfs3proc.c | 2 -
fs/nfsd/nfs4proc.c | 2 -
fs/nfsd/nfs4recover.c | 2 -
fs/nfsd/nfs4state.c | 2 -
fs/nfsd/nfsproc.c | 4 +-
fs/nfsd/vfs.c | 113 ++++++++++++++++++++++++++++++----------------
fs/nfsd/vfs.h | 4 +-
include/linux/exportfs.h | 6 ++
8 files changed, 87 insertions(+), 48 deletions(-)

diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index 3d68f45..fe3af23 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -76,7 +76,7 @@ nfsd3_proc_setattr(struct svc_rqst *rqstp, struct nfsd3_sattrargs *argp,

fh_copy(&resp->fh, &argp->fh);
nfserr = nfsd_setattr(rqstp, &resp->fh, &argp->attrs,
- argp->check_guard, argp->guardtime);
+ argp->check_guard, argp->guardtime, 0);
RETURN_STATUS(nfserr);
}

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 37514c4..b6baf30 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -803,7 +803,7 @@ nfsd4_setattr(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
if (status)
goto out;
status = nfsd_setattr(rqstp, &cstate->current_fh, &setattr->sa_iattr,
- 0, (time_t)0);
+ 0, (time_t)0, 0);
out:
mnt_drop_write(cstate->current_fh.fh_export->ex_path.mnt);
return status;
diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
index 5a754f7..4c8e1d8 100644
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -120,7 +120,7 @@ static void
nfsd4_sync_rec_dir(void)
{
mutex_lock(&rec_dir.dentry->d_inode->i_mutex);
- nfsd_sync_dir(rec_dir.dentry);
+ nfsd_sync2(rec_dir.dentry, NULL);
mutex_unlock(&rec_dir.dentry->d_inode->i_mutex);
}

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 3a20c09..e4490b5 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -2290,7 +2290,7 @@ nfsd4_truncate(struct svc_rqst *rqstp, struct svc_fh *fh,
return 0;
if (!(open->op_share_access & NFS4_SHARE_ACCESS_WRITE))
return nfserr_inval;
- return nfsd_setattr(rqstp, fh, &iattr, 0, (time_t)0);
+ return nfsd_setattr(rqstp, fh, &iattr, 0, (time_t)0, 0);
}

static __be32
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index a047ad6..3ea44a3 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -69,7 +69,7 @@ nfsd_proc_setattr(struct svc_rqst *rqstp, struct nfsd_sattrargs *argp,
argp->attrs.ia_valid, (long) argp->attrs.ia_size);

fh_copy(&resp->fh, &argp->fh);
- nfserr = nfsd_setattr(rqstp, &resp->fh, &argp->attrs,0, (time_t)0);
+ nfserr = nfsd_setattr(rqstp, &resp->fh, &argp->attrs,0, (time_t)0, 0);
return nfsd_return_attrs(nfserr, resp);
}

@@ -326,7 +326,7 @@ nfsd_proc_create(struct svc_rqst *rqstp, struct nfsd_createargs *argp,
*/
attr->ia_valid &= ATTR_SIZE;
if (attr->ia_valid)
- nfserr = nfsd_setattr(rqstp, newfhp, attr, 0, (time_t)0);
+ nfserr = nfsd_setattr(rqstp, newfhp, attr, 0, (time_t)0, 0);
}

out_unlock:
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 79d216f..7ee928b 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -27,6 +27,7 @@
#include <linux/jhash.h>
#include <linux/ima.h>
#include <asm/uaccess.h>
+#include <linux/exportfs.h>

#ifdef CONFIG_NFSD_V3
#include "xdr3.h"
@@ -278,7 +279,7 @@ out:
*/
__be32
nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
- int check_guard, time_t guardtime)
+ int check_guard, time_t guardtime, int delay_sync)
{
struct dentry *dentry;
struct inode *inode;
@@ -415,9 +416,10 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
}
if (size_change)
put_write_access(inode);
- if (!err)
- if (EX_ISSYNC(fhp->fh_export))
- write_inode_now(inode, 1);
+ if (!err) {
+ if (EX_ISSYNC(fhp->fh_export) && !delay_sync)
+ err = nfsd_sync2(NULL, dentry);
+ }
out:
return err;

@@ -769,24 +771,54 @@ nfsd_close(struct file *filp)
}

/*
- * Sync a directory to disk.
- *
- * We can't just call vfs_fsync because our requirements are slightly odd:
- *
- * a) we do not have a file struct available
- * b) we expect to have i_mutex already held by the caller
+ * Commit parent and child to stable storage. You may pass NULL for parent or
+ * child. This expects i_mutex to be held on the parent and not to be held on
+ * the child.
*/
int
-nfsd_sync_dir(struct dentry *dentry)
+nfsd_sync2(struct dentry *parent, struct dentry *child)
{
- struct inode *inode = dentry->d_inode;
- int error;
+ const struct export_operations *export_ops = NULL;
+ struct inode *p_inode = NULL, *c_inode = NULL;
+ int error = 0, error2 = 0;
+
+ if (parent) {
+ p_inode = parent->d_inode;
+ WARN_ON(!mutex_is_locked(&p_inode->i_mutex));
+ export_ops = parent->d_sb->s_export_op;
+ }
+ if (child) {
+ c_inode = child->d_inode;
+ WARN_ON(mutex_is_locked(&c_inode->i_mutex));
+ export_ops = child->d_sb->s_export_op;
+ }

- WARN_ON(!mutex_is_locked(&inode->i_mutex));
+ if (export_ops->commit_metadata) {
+ if (parent)
+ error = filemap_write_and_wait(p_inode->i_mapping);
+ if (child)
+ error2 = filemap_write_and_wait(c_inode->i_mapping);
+
+ if (error2)
+ error = error2;
+
+ if (!error) {
+ if (child)
+ mutex_lock(&c_inode->i_mutex);
+ error = export_ops->commit_metadata(parent, child);
+ if (child)
+ mutex_unlock(&c_inode->i_mutex);
+ }
+ } else {
+ if (parent) {
+ error = filemap_write_and_wait(p_inode->i_mapping);
+ if (!error && p_inode->i_fop->fsync)
+ error = p_inode->i_fop->fsync(NULL, parent, 0);
+ }
+ if (child)
+ write_inode_now(c_inode, 1);
+ }

- error = filemap_write_and_wait(inode->i_mapping);
- if (!error && inode->i_fop->fsync)
- error = inode->i_fop->fsync(NULL, dentry, 0);
return error;
}

@@ -1188,8 +1220,10 @@ nfsd_create_setattr(struct svc_rqst *rqstp, struct svc_fh *resfhp,
*/
if (current_fsuid() != 0)
iap->ia_valid &= ~(ATTR_UID|ATTR_GID);
- if (iap->ia_valid)
- return nfsd_setattr(rqstp, resfhp, iap, 0, (time_t)0);
+ if (iap->ia_valid) {
+ return nfsd_setattr(rqstp, resfhp, iap, 0, (time_t)0,
+ 1 /* delay the sync. our caller does it. */);
+ }
return 0;
}

@@ -1321,14 +1355,14 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
goto out_nfserr;
}

+ err = nfsd_create_setattr(rqstp, resfhp, iap);
if (EX_ISSYNC(fhp->fh_export)) {
- err = nfserrno(nfsd_sync_dir(dentry));
- write_inode_now(dchild->d_inode, 1);
+ err2 = nfserrno(nfsd_sync2(dentry,
+ IS_ERR(dchild) ? NULL : dchild));
+ if (err2)
+ err = err2;
}

- err2 = nfsd_create_setattr(rqstp, resfhp, iap);
- if (err2)
- err = err2;
mnt_drop_write(fhp->fh_export->ex_path.mnt);
/*
* Update the file handle to get the new inode info.
@@ -1358,7 +1392,6 @@ nfsd_create_v3(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct dentry *dentry, *dchild = NULL;
struct inode *dirp;
__be32 err;
- __be32 err2;
int host_err;
__u32 v_mtime=0, v_atime=0;

@@ -1453,11 +1486,6 @@ nfsd_create_v3(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (created)
*created = 1;

- if (EX_ISSYNC(fhp->fh_export)) {
- err = nfserrno(nfsd_sync_dir(dentry));
- /* setattr will sync the child (or not) */
- }
-
nfsd_check_ignore_resizing(iap);

if (createmode == NFS3_CREATE_EXCLUSIVE) {
@@ -1472,9 +1500,7 @@ nfsd_create_v3(struct svc_rqst *rqstp, struct svc_fh *fhp,
}

set_attr:
- err2 = nfsd_create_setattr(rqstp, resfhp, iap);
- if (err2)
- err = err2;
+ err = nfsd_create_setattr(rqstp, resfhp, iap);

mnt_drop_write(fhp->fh_export->ex_path.mnt);
/*
@@ -1484,6 +1510,14 @@ nfsd_create_v3(struct svc_rqst *rqstp, struct svc_fh *fhp,
err = fh_update(resfhp);

out:
+ if (!err) {
+ if (EX_ISSYNC(fhp->fh_export)) {
+ host_err = nfsd_sync2(created ? dentry : NULL,
+ !IS_ERR(dchild) ? dchild : NULL);
+ err = nfserrno(host_err);
+ }
+ }
+
fh_unlock(fhp);
if (dchild && !IS_ERR(dchild))
dput(dchild);
@@ -1592,7 +1626,7 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,

if (!host_err) {
if (EX_ISSYNC(fhp->fh_export))
- host_err = nfsd_sync_dir(dentry);
+ host_err = nfsd_sync2(dentry, NULL);
}
err = nfserrno(host_err);
fh_unlock(fhp);
@@ -1656,11 +1690,10 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
}
host_err = vfs_link(dold, dirp, dnew);
if (!host_err) {
+ err = 0;
if (EX_ISSYNC(ffhp->fh_export)) {
- err = nfserrno(nfsd_sync_dir(ddir));
- write_inode_now(dest, 1);
+ err = nfserrno(nfsd_sync2(ddir, dold));
}
- err = 0;
} else {
if (host_err == -EXDEV && rqstp->rq_vers == 2)
err = nfserr_acces;
@@ -1757,9 +1790,9 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,

host_err = vfs_rename(fdir, odentry, tdir, ndentry);
if (!host_err && EX_ISSYNC(tfhp->fh_export)) {
- host_err = nfsd_sync_dir(tdentry);
+ host_err = nfsd_sync2(tdentry, NULL);
if (!host_err)
- host_err = nfsd_sync_dir(fdentry);
+ host_err = nfsd_sync2(fdentry, NULL);
}

mnt_drop_write(ffhp->fh_export->ex_path.mnt);
@@ -1843,7 +1876,7 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
if (host_err)
goto out_drop;
if (EX_ISSYNC(fhp->fh_export))
- host_err = nfsd_sync_dir(dentry);
+ host_err = nfsd_sync2(dentry, NULL);

out_drop:
mnt_drop_write(fhp->fh_export->ex_path.mnt);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index 4b1de0a..3a439da 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -41,7 +41,7 @@ __be32 nfsd_lookup_dentry(struct svc_rqst *, struct svc_fh *,
const char *, unsigned int,
struct svc_export **, struct dentry **);
__be32 nfsd_setattr(struct svc_rqst *, struct svc_fh *,
- struct iattr *, int, time_t);
+ struct iattr *, int, time_t, int);
int nfsd_mountpoint(struct dentry *, struct svc_export *);
#ifdef CONFIG_NFSD_V4
__be32 nfsd4_set_nfs4_acl(struct svc_rqst *, struct svc_fh *,
@@ -91,7 +91,7 @@ __be32 nfsd_statfs(struct svc_rqst *, struct svc_fh *,
int nfsd_notify_change(struct inode *, struct iattr *);
__be32 nfsd_permission(struct svc_rqst *, struct svc_export *,
struct dentry *, int);
-int nfsd_sync_dir(struct dentry *dp);
+int nfsd_sync2(struct dentry *parent, struct dentry *child);

#if defined(CONFIG_NFSD_V2_ACL) || defined(CONFIG_NFSD_V3_ACL)
struct posix_acl *nfsd_get_posix_acl(struct svc_fh *, int);
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index dc12f41..d30dbe1 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -96,6 +96,7 @@ struct fid {
* @fh_to_parent: find the implied object's parent and get a dentry for it
* @get_name: find the name for a given inode in a given directory
* @get_parent: find the parent of a given directory
+ * @commit_metadata: commit metadata changes to stable storage
*
* See Documentation/filesystems/nfs/Exporting for details on how to use
* this interface correctly.
@@ -137,6 +138,10 @@ struct fid {
* is also a directory. In the event that it cannot be found, or storage
* space cannot be allocated, a %ERR_PTR should be returned.
*
+ * commit_metadata:
+ * @commit_metadata should commit metadata changes to stable storage.
+ * Parent or child can be NULL.
+ *
* Locking rules:
* get_parent is called with child->d_inode->i_mutex down
* get_name is not (which is possibly inconsistent)
@@ -152,6 +157,7 @@ struct export_operations {
int (*get_name)(struct dentry *parent, char *name,
struct dentry *child);
struct dentry * (*get_parent)(struct dentry *child);
+ int (*commit_metadata)(struct dentry *parent, struct dentry *child);
};

extern int exportfs_encode_fh(struct dentry *dentry, struct fid *fid,

2010-02-10 00:33:38

by Ben Myers

[permalink] [raw]

Subject: [RFC PATCH 2/2] xfs_export_operations.commit_metadata

Here is the commit_metadata export_operation for xfs. We take two dentries and
force the log up to the larger lsn. It looks to me that in nfsd the child is
always modified after the parent so generally we expect the child's lsn to be
larger. If that's not the case we'll just force the entire thing.

The basic form of this is based upon one of Christoph's suggestions. I'm an
xfs newbie so I'm not very comfortable with it yet. My understanding is that I
need to verify that all of the necessary changes make it into the transations
we're forcing into the log here. I am still looking into that and hopefully
the XFS gurus can continue to provide guidance.
---
fs/xfs/linux-2.6/xfs_export.c | 64 +++++++++++++++++++++++++++++++++++++++++
1 files changed, 64 insertions(+), 0 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_export.c b/fs/xfs/linux-2.6/xfs_export.c
index 87b8cbd..af4a214 100644
--- a/fs/xfs/linux-2.6/xfs_export.c
+++ b/fs/xfs/linux-2.6/xfs_export.c
@@ -29,6 +29,7 @@
#include "xfs_vnodeops.h"
#include "xfs_bmap_btree.h"
#include "xfs_inode.h"
+#include "xfs_inode_item.h"

/*
* Note that we only accept fileids which are long enough rather than allow
@@ -215,9 +216,72 @@ xfs_fs_get_parent(
return d_obtain_alias(VFS_I(cip));
}

+STATIC int
+xfs_fs_nfs_commit_metadata(
+ struct dentry *parent,
+ struct dentry *child)
+{
+ struct xfs_inode *p_xip = NULL, *c_xip = NULL;
+ struct xfs_mount *i_mount = NULL;
+ xfs_lsn_t force_lsn = 0;
+ int error = 0;
+
+ if (parent && !child) {
+ p_xip = XFS_I(parent->d_inode);
+ xfs_ilock(p_xip, XFS_ILOCK_SHARED);
+ if (xfs_ipincount(p_xip)) {
+ force_lsn = p_xip->i_itemp->ili_last_lsn;
+ i_mount = p_xip->i_mount;
+ }
+ } else if (child && !parent) {
+ c_xip = XFS_I(child->d_inode);
+ xfs_ilock(c_xip, XFS_ILOCK_SHARED);
+ if (xfs_ipincount(c_xip)) {
+ force_lsn = c_xip->i_itemp->ili_last_lsn;
+ i_mount = c_xip->i_mount;
+ }
+ } else if (parent && child) {
+ p_xip = XFS_I(parent->d_inode);
+ c_xip = XFS_I(child->d_inode);
+ xfs_ilock(p_xip, XFS_ILOCK_SHARED);
+ xfs_ilock(c_xip, XFS_ILOCK_SHARED);
+ if (xfs_ipincount(p_xip)) {
+ force_lsn = p_xip->i_itemp->ili_last_lsn;
+ i_mount = p_xip->i_mount;
+ }
+ if (xfs_ipincount(c_xip)) {
+ /*
+ * AFAICS the child is always modified after the parent
+ * in nfsd so should always have a larger lsn.
+ */
+ if (c_xip->i_itemp->ili_last_lsn > force_lsn) {
+ force_lsn = c_xip->i_itemp->ili_last_lsn;
+ } else {
+ force_lsn = 0; /* whole thing */
+ }
+ i_mount = c_xip->i_mount;
+ }
+ } else {
+ return -EINVAL;
+ }
+
+ if (i_mount) {
+ error = _xfs_log_force(i_mount, force_lsn,
+ XFS_LOG_FORCE | XFS_LOG_SYNC, NULL);
+ }
+
+ if (child)
+ xfs_iunlock(c_xip, XFS_ILOCK_SHARED);
+ if (parent)
+ xfs_iunlock(p_xip, XFS_ILOCK_SHARED);
+
+ return error;
+}
+
const struct export_operations xfs_export_operations = {
.encode_fh = xfs_fs_encode_fh,
.fh_to_dentry = xfs_fs_fh_to_dentry,
.fh_to_parent = xfs_fs_fh_to_parent,
.get_parent = xfs_fs_get_parent,
+ .commit_metadata = xfs_fs_nfs_commit_metadata,
};

2010-02-10 08:56:15

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] commit_metadata export operation and nfsd_sync2

Thanks, this looks very good! A few comments below:

> status = nfsd_setattr(rqstp, &cstate->current_fh, &setattr->sa_iattr,
> - 0, (time_t)0);
> + 0, (time_t)0, 0);

While you're at it you might want to remove all those superflous time_t
cases - the C propagation rules take care of it when just passing a "0".

> diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
> index 5a754f7..4c8e1d8 100644
> --- a/fs/nfsd/nfs4recover.c
> +++ b/fs/nfsd/nfs4recover.c
> @@ -120,7 +120,7 @@ static void
> nfsd4_sync_rec_dir(void)
> {
> mutex_lock(&rec_dir.dentry->d_inode->i_mutex);
> - nfsd_sync_dir(rec_dir.dentry);
> + nfsd_sync2(rec_dir.dentry, NULL);
> mutex_unlock(&rec_dir.dentry->d_inode->i_mutex);

This call should be changed to just use vfs_fsync as it doesn't
need the special semantics with the i_mutex already held - we take
it just for nfsd_sync_dir here. That also has the advantage that
nfsd_sync_dir or its replacement can be private to vfs.c now.

> /*
> + * Commit parent and child to stable storage. You may pass NULL for parent or
> + * child. This expects i_mutex to be held on the parent and not to be held on
> + * the child.
> */
> int
> +nfsd_sync2(struct dentry *parent, struct dentry *child)

I think both the local routine and the method should be the the same.
The commit_metadata seems a quite a bit better than sync2 for that.

I also think the EX_ISSYNC check should be taken into it instead
of beeing duplicated, which would require it to take a file handle argument.

> + const struct export_operations *export_ops = NULL;
> + struct inode *p_inode = NULL, *c_inode = NULL;

Normally we'd just use parent and child for this, or posisbly
just inode and child.

> + if (parent) {
> + p_inode = parent->d_inode;
> + WARN_ON(!mutex_is_locked(&p_inode->i_mutex));
> + export_ops = parent->d_sb->s_export_op;
> + }
> + if (child) {
> + c_inode = child->d_inode;
> + WARN_ON(mutex_is_locked(&c_inode->i_mutex));
> + export_ops = child->d_sb->s_export_op;
> + }

A better calling convention would be for the first paramter to
always be non-zero (and we could take the file handle for that one),
with only the second argument beeing optional. That would require
using the same method for the fallback sync, which we should do anyway.

Currently the only caller passing a NULL first argument is
nfsd_setattr. If we use ->write_inode as fallback we could just
pass it as first, if using ->fsync we'd need to take i_mutex before,
but we might just stick to using ->write_inode to keep things
simple (and get rid of the NULL file pointer in ->fsync).

> + if (export_ops->commit_metadata) {
> + if (parent)
> + error = filemap_write_and_wait(p_inode->i_mapping);
> + if (child)
> + error2 = filemap_write_and_wait(c_inode->i_mapping);

I think Trond explained that we do not want force data to disk here.

> + if (!error) {
> + if (child)
> + mutex_lock(&c_inode->i_mutex);
> + error = export_ops->commit_metadata(parent, child);
> + if (child)
> + mutex_unlock(&c_inode->i_mutex);

Note that at least xfs doesn't care about the i_mutex here.

> + if (parent) {
> + error = filemap_write_and_wait(p_inode->i_mapping);
> + if (!error && p_inode->i_fop->fsync)
> + error = p_inode->i_fop->fsync(NULL, parent, 0);
> + }
> + if (child)
> + write_inode_now(c_inode, 1);
> + }

Btw, as we don't need to write data to disk this should be a
sync_inode calls with the following second argument:

struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 0, /* metadata-only */
};

>
> @@ -1321,14 +1355,14 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> goto out_nfserr;
> }
>
> + err = nfsd_create_setattr(rqstp, resfhp, iap);
> if (EX_ISSYNC(fhp->fh_export)) {
> - err = nfserrno(nfsd_sync_dir(dentry));
> - write_inode_now(dchild->d_inode, 1);
> + err2 = nfserrno(nfsd_sync2(dentry,
> + IS_ERR(dchild) ? NULL : dchild));

I can't see how you could ever get a dchild that is an error pointer
here.

> @@ -1484,6 +1510,14 @@ nfsd_create_v3(struct svc_rqst *rqstp, struct svc_fh *fhp,
> err = fh_update(resfhp);
>
> out:
> + if (!err) {
> + if (EX_ISSYNC(fhp->fh_export)) {
> + host_err = nfsd_sync2(created ? dentry : NULL,
> + !IS_ERR(dchild) ? dchild : NULL);
> + err = nfserrno(host_err);
> + }
> + }
> +

The placement of this call looks incorrect to me. We don't want
this after the out label which is used for errors, but directly after
the nfsd_create_setattr call, and most importantly before the
mnt_drop_write after which we're not allowed to the filesystem anymore.
That also makes sure we'll never get dchild or dentry as an error
pointer.

2010-02-10 09:07:51

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [RFC PATCH 2/2] xfs_export_operations.commit_metadata

On Tue, Feb 09, 2010 at 06:33:37PM -0600, Ben Myers wrote:
> Here is the commit_metadata export_operation for xfs. We take two dentries and
> force the log up to the larger lsn. It looks to me that in nfsd the child is
> always modified after the parent so generally we expect the child's lsn to be
> larger. If that's not the case we'll just force the entire thing.
>
> The basic form of this is based upon one of Christoph's suggestions. I'm an
> xfs newbie so I'm not very comfortable with it yet. My understanding is that I
> need to verify that all of the necessary changes make it into the transations
> we're forcing into the log here. I am still looking into that and hopefully
> the XFS gurus can continue to provide guidance.

Ccing the xfs list would help with that :) Anyway, I think it looks
pretty good, but there's quite a few smaller nitpicks:

> +STATIC int
> +xfs_fs_nfs_commit_metadata(
> + struct dentry *parent,
> + struct dentry *child)
> +{
> + struct xfs_inode *p_xip = NULL, *c_xip = NULL;

Normal xfs naming would be dp for the parent, and ip for the child,
it would be good to stick to that.

> + struct xfs_mount *i_mount = NULL;

Normal name all over xfs would be mp.

> + } else if (parent && child) {
> + p_xip = XFS_I(parent->d_inode);
> + c_xip = XFS_I(child->d_inode);
> + xfs_ilock(p_xip, XFS_ILOCK_SHARED);
> + xfs_ilock(c_xip, XFS_ILOCK_SHARED);

If we need to lock both parent and child we need to use
xfs_lock_two_inodes to make sure the lock order is correct.

> + if (xfs_ipincount(c_xip)) {
> + /*
> + * AFAICS the child is always modified after the parent
> + * in nfsd so should always have a larger lsn.
> + */
> + if (c_xip->i_itemp->ili_last_lsn > force_lsn) {
> + force_lsn = c_xip->i_itemp->ili_last_lsn;
> + } else {
> + force_lsn = 0; /* whole thing */
> + }

I wouldn't rely on that and always take the larger one.

Now with the simplification of always having a non-zero first argument
suggested in the previous mail this might be simplified down to:

STATIC int
xfs_fs_nfs_commit_metadata(
struct dentry *parent,
struct dentry *child)
{
struct xfs_inode *dp = XFS_I(parent->d_inode);
struct xfs_inode *ip = NULL;
struct xfs_mount *mp = dp->i_mount;
xfs_lsn_t force_lsn = 0;
int error = 0;

if (child) {
ip = XFS_I(child->d_inode);
xfs_lock_two_inodes(dp, ip, XFS_ILOCK_SHARED);
} else {
xfs_ilock(dp, XFS_ILOCK_SHARED);
}

if (xfs_ipincount(dp))
force_lsn = dp->i_itemp->ili_last_lsn;
if (ip && xfs_ipincount(ip))
force_lsn = max(force_lsn, ip->i_itemp->ili_last_lsn);

error = _xfs_log_force_lsn(mp, force_lsn, NULL);

if (ip)
xfs_iunlock(ip, XFS_ILOCK_SHARED);
xfs_iunlock(dp, XFS_ILOCK_SHARED);

return error;
}

Note that _xfs_log_force_lsn is new in the XFS tree, mainline still
has _xfs_log_force with an lsn argument.

Also note that ->commit_metadata probably should take just two inodes
instead of two dentires given the level it operates on, but it shouldn't
matter too much.

2010-02-10 10:11:31

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [RFC PATCH 2/2] xfs_export_operations.commit_metadata

On Wed, Feb 10, 2010 at 04:07:50AM -0500, Christoph Hellwig wrote:
> > + /*
> > + * AFAICS the child is always modified after the parent
> > + * in nfsd so should always have a larger lsn.
> > + */
> > + if (c_xip->i_itemp->ili_last_lsn > force_lsn) {
> > + force_lsn = c_xip->i_itemp->ili_last_lsn;
> > + } else {
> > + force_lsn = 0; /* whole thing */
> > + }
>
> I wouldn't rely on that and always take the larger one.

Or we could use that fact for making the prototype saner:

- the commit_metadata only takes a single inode to force out
- we make sure to always call in on the child first. For any
log based filesystem that will force the parent update, too.
- we then call it on the parent, which will be a no-op and thus
fast for a log based filesystem, but still provide a fallback
if that is not the case.

2010-02-10 19:53:05

by Ben Myers

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] commit_metadata export operation and nfsd_sync2

Hey Christoph,

Thanks for the suggestions.

On Wed, Feb 10, 2010 at 03:56:14AM -0500, Christoph Hellwig wrote:
> A better calling convention would be for the first paramter to
> always be non-zero (and we could take the file handle for that one),

Yeah that works out much nicer.

> Currently the only caller passing a NULL first argument is
> nfsd_setattr. If we use ->write_inode as fallback we could just
> pass it as first, if using ->fsync we'd need to take i_mutex before,
> but we might just stick to using ->write_inode to keep things
> simple (and get rid of the NULL file pointer in ->fsync).
>
> > + if (export_ops->commit_metadata) {
> > + if (parent)
> > + error = filemap_write_and_wait(p_inode->i_mapping);
> > + if (child)
> > + error2 = filemap_write_and_wait(c_inode->i_mapping);
>
> I think Trond explained that we do not want force data to disk here.

Thought so. I was making an effort to punch all the same buttons as
before so that behavior wouldn't change for filesystems other than xfs
until they're ready.

> > + if (parent) {
> > + error = filemap_write_and_wait(p_inode->i_mapping);
> > + if (!error && p_inode->i_fop->fsync)
> > + error = p_inode->i_fop->fsync(NULL, parent, 0);
> > + }
> > + if (child)
> > + write_inode_now(c_inode, 1);
> > + }
>
> Btw, as we don't need to write data to disk this should be a
> sync_inode calls with the following second argument:
>
> struct writeback_control wbc = {
> .sync_mode = WB_SYNC_ALL,
> .nr_to_write = 0, /* metadata-only */
> };

So, uh, are you suggesting that I use file_operations.fsync,
write_inode_now, vfs_fsync, write_inode, sync_inode, or
super_operations.write_inode? Would it be good to stay away from the
inode_lock?

-Ben

2010-02-10 20:14:51

by Ben Myers

[permalink] [raw]

Subject: Re: [RFC PATCH 2/2] xfs_export_operations.commit_metadata

Howdy,

On Wed, Feb 10, 2010 at 04:07:50AM -0500, Christoph Hellwig wrote:
> On Tue, Feb 09, 2010 at 06:33:37PM -0600, Ben Myers wrote:
> > hopefully
> > the XFS gurus can continue to provide guidance.
>
> Ccing the xfs list would help with that :)

I'll cross-post the next rev. ;)

> > + if (xfs_ipincount(c_xip)) {
> > + /*
> > + * AFAICS the child is always modified after the parent
> > + * in nfsd so should always have a larger lsn.
> > + */
> > + if (c_xip->i_itemp->ili_last_lsn > force_lsn) {
> > + force_lsn = c_xip->i_itemp->ili_last_lsn;
> > + } else {
> > + force_lsn = 0; /* whole thing */
> > + }
>
> I wouldn't rely on that and always take the larger one.

I am concerned that ili_last_lsn will roll over at some point. I
haven't figured out how to detect that yet.

> Now with the simplification of always having a non-zero first argument
> suggested in the previous mail this might be simplified down to:
>
> STATIC int
> xfs_fs_nfs_commit_metadata(
> struct dentry *parent,
> struct dentry *child)
> {
> struct xfs_inode *dp = XFS_I(parent->d_inode);
> struct xfs_inode *ip = NULL;
> struct xfs_mount *mp = dp->i_mount;
> xfs_lsn_t force_lsn = 0;
> int error = 0;
>
> if (child) {
> ip = XFS_I(child->d_inode);
> xfs_lock_two_inodes(dp, ip, XFS_ILOCK_SHARED);
> } else {
> xfs_ilock(dp, XFS_ILOCK_SHARED);
> }
>
> if (xfs_ipincount(dp))
> force_lsn = dp->i_itemp->ili_last_lsn;
> if (ip && xfs_ipincount(ip))
> force_lsn = max(force_lsn, ip->i_itemp->ili_last_lsn);
>
+ if (force_lsn)
> error = _xfs_log_force_lsn(mp, force_lsn, NULL);
>
> if (ip)
> xfs_iunlock(ip, XFS_ILOCK_SHARED);
> xfs_iunlock(dp, XFS_ILOCK_SHARED);
>
> return error;
> }

That's nice!

> Note that _xfs_log_force_lsn is new in the XFS tree, mainline still
> has _xfs_log_force with an lsn argument.

I've been working with Bruce's tree. Not sure how best to handle that.

Thanks,
Ben

2010-02-10 21:29:14

by Dave Chinner

[permalink] [raw]

Subject: Re: [RFC PATCH 2/2] xfs_export_operations.commit_metadata

On Wed, Feb 10, 2010 at 02:15:27PM -0600, [email protected] wrote:
> Howdy,
>
> On Wed, Feb 10, 2010 at 04:07:50AM -0500, Christoph Hellwig wrote:
> > On Tue, Feb 09, 2010 at 06:33:37PM -0600, Ben Myers wrote:
> > > hopefully
> > > the XFS gurus can continue to provide guidance.
> >
> > Ccing the xfs list would help with that :)
>
> I'll cross-post the next rev. ;)
>
> > > + if (xfs_ipincount(c_xip)) {
> > > + /*
> > > + * AFAICS the child is always modified after the parent
> > > + * in nfsd so should always have a larger lsn.
> > > + */
> > > + if (c_xip->i_itemp->ili_last_lsn > force_lsn) {
> > > + force_lsn = c_xip->i_itemp->ili_last_lsn;
> > > + } else {
> > > + force_lsn = 0; /* whole thing */
> > > + }
> >
> > I wouldn't rely on that and always take the larger one.
>
> I am concerned that ili_last_lsn will roll over at some point. I
> haven't figured out how to detect that yet.

XFS_LSN_CMP() should be used for LSN comparisons - it handles
rollover corectly.

> > Now with the simplification of always having a non-zero first argument
> > suggested in the previous mail this might be simplified down to:
> >
> > STATIC int
> > xfs_fs_nfs_commit_metadata(
> > struct dentry *parent,
> > struct dentry *child)
> > {
> > struct xfs_inode *dp = XFS_I(parent->d_inode);
> > struct xfs_inode *ip = NULL;
> > struct xfs_mount *mp = dp->i_mount;
> > xfs_lsn_t force_lsn = 0;
> > int error = 0;
> >
> > if (child) {
> > ip = XFS_I(child->d_inode);
> > xfs_lock_two_inodes(dp, ip, XFS_ILOCK_SHARED);
> > } else {
> > xfs_ilock(dp, XFS_ILOCK_SHARED);
> > }
> >
> > if (xfs_ipincount(dp))
> > force_lsn = dp->i_itemp->ili_last_lsn;
> > if (ip && xfs_ipincount(ip))
> > force_lsn = max(force_lsn, ip->i_itemp->ili_last_lsn);

So this should be:

if (XFS_LSN_CMP(force_lsn, ip->i_itemp->ili_last_lsn) < 0)
force_lsn = ip->i_itemp->ili_last_lsn;

Cheers,

Dave.
--
Dave Chinner
[email protected]

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2010-02-10 21:57:00

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] commit_metadata export operation and nfsd_sync2

On Wed, Feb 10, 2010 at 01:53:43PM -0600, [email protected] wrote:
> > I think Trond explained that we do not want force data to disk here.
>
> Thought so. I was making an effort to punch all the same buttons as
> before so that behavior wouldn't change for filesystems other than xfs
> until they're ready.

Well, the data writing really isn't filesystem dependent, but depends
on the nfs semantics.

> > Btw, as we don't need to write data to disk this should be a
> > sync_inode calls with the following second argument:
> >
> > struct writeback_control wbc = {
> > .sync_mode = WB_SYNC_ALL,
> > .nr_to_write = 0, /* metadata-only */
> > };
>
> So, uh, are you suggesting that I use file_operations.fsync,
> write_inode_now, vfs_fsync, write_inode, sync_inode, or
> super_operations.write_inode? Would it be good to stay away from the
> inode_lock?

->write_inode is the method that both write_inode_now and sync_inode
end up calling, in fact write_inode_now could be implemented on top
of sync_inode. The difference is that write_inode_now already
prepares a writeback_control structure that already writes out data,
and optionally waits for data I/O to complete, while sync_inode
leaves all that to the caller. If we dont want to bother with
writing out data we have to use sync_inode if we want to use
the ->write_inode path. Given that we want to make changes that
happened through the filesystem metadata methods stable and not
user data using ->write_inode makes more sense to me than ->fsync.
Especially as ->fsync normally requires a file pointer and only
allows that to be NULL for nfs, which requires special cases all
over the place that we could get rid of.

The global lock in that path is not nice, but Nick Piggin has already
started splitting it up.

2010-02-10 21:57:45

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [RFC PATCH 2/2] xfs_export_operations.commit_metadata

On Wed, Feb 10, 2010 at 02:15:27PM -0600, [email protected] wrote:
> > Note that _xfs_log_force_lsn is new in the XFS tree, mainline still
> > has _xfs_log_force with an lsn argument.
>
> I've been working with Bruce's tree. Not sure how best to handle that.

Keep working against it for now, the merge fixup will be easy enough.