LinuxLists.cc - RFC [PATCH 0/6] Client support for crossing NFS server mountpoints

2006-04-11 17:45:43

Subject: RFC [PATCH 0/6] Client support for crossing NFS server mountpoints

The following series of patches implement NFS client support for crossing
server submounts (assuming that the server is exporting them using the
'nohide' option). We wish to ensure that inode numbers remain unique
on either side of the mountpoint, so that programs like 'tar' and
'rsync' do not get confused when confronted with files that have the same
inode number, but are actually on different filesystems on the server.

This is achieved by having the client automatically create a submount
that mirrors the one on the server.

In order to avoid confusing users, we would like for this mountpoint to b=
e
transparent to 'umount': IOW: when the user mounts the filesystem '/foo',
then an automatic submount by the NFS client for /foo/bar should not caus=
e
'umount /foo' (particularly since the kernel cannot create entries for
/foo/bar in /etc/mtab). To get around this we mark automatically
created submounts using the new flag MNT_SHRINKABLE, and then allow
the NFS client to attempt to unmount them whenever the user calls umount =
on
the parent.

Note: This code also serves as the base for NFSv4 'referral' support, in
which one server may direct the client to a different server as it crosse=
s
into a filesystem that has been migrated.

Cheers,
Trond

2006-04-17 18:52:13

by Christoph Hellwig

[permalink] [raw]

Subject: Re: RFC [PATCH 1/6] VFS: Add GPL_EXPORTED function vfs_kern_mount()

On Tue, Apr 11, 2006 at 02:05:30PM -0400, Trond Myklebust wrote:
> From: Trond Myklebust <[email protected]>
>
> do_kern_mount() does not allow the kernel to use private mount interfaces
> without exposing the same interfaces to userland. The problem is that the
> filesystem is referenced by name, thus meaning that it and its mount
> interface must be registered in the global filesystem list.
>
> vfs_kern_mount() passes the struct file_system_type as an explicit
> parameter in order to overcome this limitation.

Looks good. In addition please switch kern_mount to use it instead
of converting from struct file_system_type to name and back. Also
all other callers of do_kern_mount except for do_new_mount should
probably use it directly instead of doing the name lookup. Except
for simple_pin_fs() which will need a paramter change all those
would be trivial aswell. So instead of adding another entry point care
to switch the existing one to saner prototype and the sane name?

2006-04-17 19:35:43

by Myklebust, Trond

[permalink] [raw]

Subject: Re: RFC [PATCH 1/6] VFS: Add GPL_EXPORTED function vfs_kern_mount()

On Mon, 2006-04-17 at 19:52 +0100, Christoph Hellwig wrote:
> On Tue, Apr 11, 2006 at 02:05:30PM -0400, Trond Myklebust wrote:
> > From: Trond Myklebust <[email protected]>
> >
> > do_kern_mount() does not allow the kernel to use private mount interfaces
> > without exposing the same interfaces to userland. The problem is that the
> > filesystem is referenced by name, thus meaning that it and its mount
> > interface must be registered in the global filesystem list.
> >
> > vfs_kern_mount() passes the struct file_system_type as an explicit
> > parameter in order to overcome this limitation.
>
> Looks good. In addition please switch kern_mount to use it instead
> of converting from struct file_system_type to name and back. Also
> all other callers of do_kern_mount except for do_new_mount should
> probably use it directly instead of doing the name lookup. Except
> for simple_pin_fs() which will need a paramter change all those
> would be trivial aswell. So instead of adding another entry point care
> to switch the existing one to saner prototype and the sane name?

That sounds reasonable. By 'switch to the sane name' you do mean convert
all uses of 'do_kern_mount' to 'vfs_kern_mount'?

Cheers,
Trond

2006-04-17 19:39:31

by Christoph Hellwig

[permalink] [raw]

Subject: Re: RFC [PATCH 1/6] VFS: Add GPL_EXPORTED function vfs_kern_mount()

On Mon, Apr 17, 2006 at 03:35:43PM -0400, Trond Myklebust wrote:
> > all other callers of do_kern_mount except for do_new_mount should
> > probably use it directly instead of doing the name lookup. Except
> > for simple_pin_fs() which will need a paramter change all those
> > would be trivial aswell. So instead of adding another entry point care
> > to switch the existing one to saner prototype and the sane name?
>
> That sounds reasonable. By 'switch to the sane name' you do mean convert
> all uses of 'do_kern_mount' to 'vfs_kern_mount'?

Yes, sorry for the odd wording.

2006-04-17 20:44:33

by Myklebust, Trond

[permalink] [raw]

Subject: Re: RFC [PATCH 1/6] VFS: Add GPL_EXPORTED function vfs_kern_mount()

_______________________________________________
NFSv4 mailing list
[email protected]
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

Attachments:

linux-2.6.17-019-unexport_do_kern_mount.dif (7.93 kB)
(No filename) (138.00 B)
Download all attachments

2006-04-17 23:39:32

by Myklebust, Trond

[permalink] [raw]

Subject: Re: RFC [PATCH 1/6] VFS: Add GPL_EXPORTED function vfs_kern_mount()

_______________________________________________
NFSv4 mailing list
[email protected]
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

Attachments:

linux-2.6.17-019-unexport_do_kern_mount.dif (7.91 kB)
(No filename) (138.00 B)
Download all attachments

2006-04-11 18:05:35

by Myklebust, Trond

[permalink] [raw]

Subject: RFC [PATCH 3/6] VFS: Remove dependency of ->umount_begin() call on MNT_FORCE

From: Trond Myklebust <[email protected]>

Allow filesystems to decide to perform pre-umount processing whether or not
MNT_FORCE is set.

Signed-off-by: Trond Myklebust <[email protected]>
---

fs/9p/vfs_super.c | 7 ++++---
fs/cifs/cifsfs.c | 6 ++++--
fs/fuse/inode.c | 5 +++--
fs/namespace.c | 4 ++--
fs/nfs/inode.c | 14 +++++++++-----
include/linux/fs.h | 2 +-
6 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/fs/9p/vfs_super.c b/fs/9p/vfs_super.c
index 61c599b..00c1f6b 100644
--- a/fs/9p/vfs_super.c
+++ b/fs/9p/vfs_super.c
@@ -253,11 +253,12 @@ static int v9fs_show_options(struct seq_
}

static void
-v9fs_umount_begin(struct super_block *sb)
+v9fs_umount_begin(struct vfsmount *vfsmnt, int flags)
{
- struct v9fs_session_info *v9ses = sb->s_fs_info;
+ struct v9fs_session_info *v9ses = vfsmnt->mnt_sb->s_fs_info;

- v9fs_session_cancel(v9ses);
+ if (flags & MNT_FORCE)
+ v9fs_session_cancel(v9ses);
}

static struct super_operations v9fs_super_ops = {
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index d4b713e..8c60c53 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -404,12 +404,14 @@ static struct quotactl_ops cifs_quotactl
#endif

#ifdef CONFIG_CIFS_EXPERIMENTAL
-static void cifs_umount_begin(struct super_block * sblock)
+static void cifs_umount_begin(struct vfsmount * vfsmnt, int flags)
{
struct cifs_sb_info *cifs_sb;
struct cifsTconInfo * tcon;

- cifs_sb = CIFS_SB(sblock);
+ if (!(flags & MNT_FORCE))
+ return;
+ cifs_sb = CIFS_SB(vfsmnt->mnt_sb);
if(cifs_sb == NULL)
return;

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index fd34037..7b3d4e7 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -195,9 +195,10 @@ struct inode *fuse_iget(struct super_blo
return inode;
}

-static void fuse_umount_begin(struct super_block *sb)
+static void fuse_umount_begin(struct vfsmount *vfsmnt, int flags)
{
- fuse_abort_conn(get_fuse_conn_super(sb));
+ if (flags & MNT_FORCE)
+ fuse_abort_conn(get_fuse_conn_super(vfsmnt->mnt_sb));
}

static void fuse_put_super(struct super_block *sb)
diff --git a/fs/namespace.c b/fs/namespace.c
index 7bff436..b21c5c2 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -576,8 +576,8 @@ static int do_umount(struct vfsmount *mn
*/

lock_kernel();
- if ((flags & MNT_FORCE) && sb->s_op->umount_begin)
- sb->s_op->umount_begin(sb);
+ if (sb->s_op->umount_begin)
+ sb->s_op->umount_begin(mnt, flags);
unlock_kernel();

/*
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 1fd3452..cfcc585 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -64,7 +64,7 @@ static void nfs_destroy_inode(struct ino
static int nfs_write_inode(struct inode *,int);
static void nfs_delete_inode(struct inode *);
static void nfs_clear_inode(struct inode *);
-static void nfs_umount_begin(struct super_block *);
+static void nfs_umount_begin(struct vfsmount *, int);
static int nfs_statfs(struct super_block *, struct kstatfs *);
static int nfs_show_options(struct seq_file *, struct vfsmount *);
static int nfs_show_stats(struct seq_file *, struct vfsmount *);
@@ -179,15 +179,19 @@ nfs_clear_inode(struct inode *inode)
BUG_ON(atomic_read(&nfsi->data_updates) != 0);
}

-void
-nfs_umount_begin(struct super_block *sb)
+static void nfs_umount_begin(struct vfsmount *vfsmnt, int flags)
{
- struct rpc_clnt *rpc = NFS_SB(sb)->client;
+ struct nfs_server *server;
+ struct rpc_clnt *rpc;

+ if (!(flags & MNT_FORCE))
+ return;
/* -EIO all pending I/O */
+ server = NFS_SB(vfsmnt->mnt_sb);
+ rpc = server->client;
if (!IS_ERR(rpc))
rpc_killall_tasks(rpc);
- rpc = NFS_SB(sb)->client_acl;
+ rpc = server->client_acl;
if (!IS_ERR(rpc))
rpc_killall_tasks(rpc);
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 162c6e5..f83400a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1100,7 +1100,7 @@ struct super_operations {
int (*statfs) (struct super_block *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *);
void (*clear_inode) (struct inode *);
- void (*umount_begin) (struct super_block *);
+ void (*umount_begin) (struct vfsmount *, int);

int (*show_options)(struct seq_file *, struct vfsmount *);
int (*show_stats)(struct seq_file *, struct vfsmount *);

2006-04-11 18:05:41

by Myklebust, Trond

[permalink] [raw]

Subject: RFC [PATCH 6/6] NFS: Add timeout to submounts

From: Trond Myklebust <[email protected]>

Make automounted partitions expire using the mark_mounts_for_expiry()
function. The timeout is controlled via a sysctl.

Signed-off-by: Trond Myklebust <[email protected]>
---

fs/nfs/inode.c | 3 +++
fs/nfs/namespace.c | 25 ++++++++++++++++++++++++-
fs/nfs/sysctl.c | 10 ++++++++++
include/linux/nfs_fs.h | 3 +++
4 files changed, 40 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index f5a133f..e051d00 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -184,6 +184,7 @@ static void nfs_umount_begin(struct vfsm
struct nfs_server *server;
struct rpc_clnt *rpc;

+ shrink_submounts(vfsmnt, &nfs_automount_list);
if (!(flags & MNT_FORCE))
return;
/* -EIO all pending I/O */
@@ -1964,6 +1965,7 @@ static void nfs_kill_super(struct super_
nfs_free_iostats(server->io_stats);
kfree(server->hostname);
kfree(server);
+ nfs_release_automount_timer();
}

static struct file_system_type nfs_fs_type = {
@@ -2310,6 +2312,7 @@ static void nfs4_kill_super(struct super
nfs_free_iostats(server->io_stats);
kfree(server->hostname);
kfree(server);
+ nfs_release_automount_timer();
}

static struct file_system_type nfs4_fs_type = {
diff --git a/fs/nfs/namespace.c b/fs/nfs/namespace.c
index a155505..e426516 100644
--- a/fs/nfs/namespace.c
+++ b/fs/nfs/namespace.c
@@ -18,6 +18,11 @@ #include <linux/vfs.h>

#define NFSDBG_FACILITY NFSDBG_VFS

+LIST_HEAD(nfs_automount_list);
+static void nfs_expire_automounts(void *list);
+static DECLARE_WORK(nfs_automount_task, nfs_expire_automounts, &nfs_automount_list);
+int nfs_mountpoint_expiry_timeout = 500 * HZ;
+
/*
* nfs_follow_mountpoint - handle crossing a mountpoint on the server
* @dentry - dentry of mountpoint
@@ -59,7 +64,7 @@ static void * nfs_follow_mountpoint(stru
goto out_err;

mntget(mnt);
- err = do_add_mount(mnt, nd, nd->mnt->mnt_flags, NULL);
+ err = do_add_mount(mnt, nd, nd->mnt->mnt_flags|MNT_SHRINKABLE, &nfs_automount_list);
if (err < 0) {
mntput(mnt);
if (err == -EBUSY)
@@ -70,6 +75,7 @@ static void * nfs_follow_mountpoint(stru
dput(nd->dentry);
nd->mnt = mnt;
nd->dentry = dget(mnt->mnt_root);
+ schedule_delayed_work(&nfs_automount_task, nfs_mountpoint_expiry_timeout);
out:
dprintk("%s: done, returned %d\n", __FUNCTION__, err);
return ERR_PTR(err);
@@ -87,3 +93,20 @@ struct inode_operations nfs_mountpoint_i
.follow_link = nfs_follow_mountpoint,
.getattr = nfs_getattr,
};
+
+static void nfs_expire_automounts(void *data)
+{
+ struct list_head *list = (struct list_head *)data;
+
+ mark_mounts_for_expiry(list);
+ if (!list_empty(list))
+ schedule_delayed_work(&nfs_automount_task, nfs_mountpoint_expiry_timeout);
+}
+
+void nfs_release_automount_timer(void)
+{
+ if (list_empty(&nfs_automount_list)) {
+ cancel_delayed_work(&nfs_automount_task);
+ flush_scheduled_work();
+ }
+}
diff --git a/fs/nfs/sysctl.c b/fs/nfs/sysctl.c
index 4c486eb..db61e51 100644
--- a/fs/nfs/sysctl.c
+++ b/fs/nfs/sysctl.c
@@ -12,6 +12,7 @@ #include <linux/sysctl.h>
#include <linux/module.h>
#include <linux/nfs4.h>
#include <linux/nfs_idmap.h>
+#include <linux/nfs_fs.h>

#include "callback.h"

@@ -46,6 +47,15 @@ #ifdef CONFIG_NFS_V4
.strategy = &sysctl_jiffies,
},
#endif
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "nfs_mountpoint_timeout",
+ .data = &nfs_mountpoint_expiry_timeout,
+ .maxlen = sizeof(nfs_mountpoint_expiry_timeout),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_jiffies,
+ .strategy = &sysctl_jiffies,
+ },
{ .ctl_name = 0 }
};

diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 7cd75e0..fe1e962 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -400,7 +400,10 @@ #endif
/*
* linux/fs/nfs/namespace.c
*/
+extern struct list_head nfs_automount_list;
extern struct inode_operations nfs_mountpoint_inode_operations;
+extern int nfs_mountpoint_expiry_timeout;
+extern void nfs_release_automount_timer(void);

/*
* linux/fs/nfs/unlink.c

2006-04-11 18:05:39

by Myklebust, Trond

[permalink] [raw]

Subject: RFC [PATCH 5/6] NFS: Ensure the client submounts, when it crosses a server mountpoint.

From: Trond Myklebust <[email protected]>

Signed-off-by: Trond Myklebust <[email protected]>
---

fs/nfs/Makefile | 3=20
fs/nfs/dir.c | 16 +++
fs/nfs/inode.c | 303 ++++++++++++++++++++++++++++++++++++++++++=
++++++
fs/nfs/namespace.c | 89 ++++++++++++++
fs/nfs/nfs4_fs.h | 1=20
fs/nfs/nfs4proc.c | 2=20
include/linux/nfs_fs.h | 9 +
7 files changed, 418 insertions(+), 5 deletions(-)

diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index ec61fd5..d9d494c 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -5,7 +5,8 @@ #
obj-$(CONFIG_NFS_FS) +=3D nfs.o
=20
nfs-y :=3D dir.o file.o inode.o nfs2xdr.o pagelist.o \
- proc.o read.o symlink.o unlink.o write.o
+ proc.o read.o symlink.o unlink.o write.o \
+ namespace.o
nfs-$(CONFIG_ROOT_NFS) +=3D nfsroot.o mount_clnt.o =20
nfs-$(CONFIG_NFS_V3) +=3D nfs3proc.o nfs3xdr.o
nfs-$(CONFIG_NFS_V3_ACL) +=3D nfs3acl.o
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index a23f348..866672a 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -869,6 +869,17 @@ int nfs_is_exclusive_create(struct inode
return (nd->intent.open.flags & O_EXCL) !=3D 0;
}
=20
+static inline int nfs_reval_fsid(struct inode *dir,
+ struct nfs_fh *fh, struct nfs_fattr *fattr)
+{
+ struct nfs_server *server =3D NFS_SERVER(dir);
+
+ if (!nfs_fsid_equal(&server->fsid, &fattr->fsid))
+ /* Revalidate fsid on root dir */
+ return __nfs_revalidate_inode(server, dir->i_sb->s_root->d_inode);
+ return 0;
+}
+
static struct dentry *nfs_lookup(struct inode *dir, struct dentry * dent=
ry, struct nameidata *nd)
{
struct dentry *res;
@@ -897,6 +908,11 @@ static struct dentry *nfs_lookup(struct=20
error =3D NFS_PROTO(dir)->lookup(dir, &dentry->d_name, &fhandle, &fattr=
);
if (error =3D=3D -ENOENT)
goto no_entry;
+ if (error < 0) {
+ res =3D ERR_PTR(error);
+ goto out_unlock;
+ }
+ error =3D nfs_reval_fsid(dir, &fhandle, &fattr);
if (error < 0) {
res =3D ERR_PTR(error);
goto out_unlock;
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index bf9d404..f5a133f 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -238,6 +238,14 @@ nfs_block_size(unsigned long bsize, unsi
return nfs_block_bits(bsize, nrbitsp);
}
=20
+static inline void
+nfs_super_set_maxbytes(struct super_block *sb, __u64 maxfilesize)
+{
+ sb->s_maxbytes =3D (loff_t)maxfilesize;
+ if (sb->s_maxbytes > MAX_LFS_FILESIZE || sb->s_maxbytes <=3D 0)
+ sb->s_maxbytes =3D MAX_LFS_FILESIZE;
+}
+
/*
* Obtain the root inode of the file system.
*/
@@ -348,9 +356,7 @@ nfs_sb_init(struct super_block *sb, rpc_
}
server->backing_dev_info.ra_pages =3D server->rpages * NFS_MAX_READAHEA=
D;
=20
- sb->s_maxbytes =3D fsinfo.maxfilesize;
- if (sb->s_maxbytes > MAX_LFS_FILESIZE)=20
- sb->s_maxbytes =3D MAX_LFS_FILESIZE;=20
+ nfs_super_set_maxbytes(sb, fsinfo.maxfilesize);
=20
server->client->cl_intr =3D (server->flags & NFS_MOUNT_INTR) ? 1 : 0;
server->client->cl_softrtry =3D (server->flags & NFS_MOUNT_SOFT) ? 1 : =
0;
@@ -897,6 +903,11 @@ nfs_fhget(struct super_block *sb, struct
if (nfs_server_capable(inode, NFS_CAP_READDIRPLUS)
&& fattr->size <=3D NFS_LIMIT_READDIRPLUS)
set_bit(NFS_INO_ADVISE_RDPLUS, &NFS_FLAGS(inode));
+ /* Deal with crossing mountpoints */
+ if (!nfs_fsid_equal(&NFS_SB(sb)->fsid, &fattr->fsid)) {
+ inode->i_op =3D &nfs_mountpoint_inode_operations;
+ inode->i_fop =3D NULL;
+ }
} else if (S_ISLNK(inode->i_mode))
inode->i_op =3D &nfs_symlink_inode_operations;
else
@@ -1670,6 +1681,141 @@ #endif
/*
* File system information
*/
+
+/*
+ * nfs_path - reconstruct the path given an arbitrary dentry
+ * @base - arbitrary string to prepend to the path
+ * @dentry - pointer to dentry
+ * @buffer - result buffer
+ * @buflen - length of buffer
+ *
+ * Helper function for constructing the path from the
+ * root dentry to an arbitrary hashed dentry.
+ *
+ * This is mainly for use in figuring out the path on the
+ * server side when automounting on top of an existing partition.
+ */
+static char *nfs_path(const char *base, const struct dentry *dentry,
+ char *buffer, ssize_t buflen)
+{
+ char *end =3D buffer+buflen;
+ int namelen;
+
+ *--end =3D '\0';
+ buflen--;
+ spin_lock(&dcache_lock);
+ while (!IS_ROOT(dentry)) {
+ namelen =3D dentry->d_name.len;
+ buflen -=3D namelen + 1;
+ if (buflen < 0)
+ goto Elong;
+ end -=3D namelen;
+ memcpy(end, dentry->d_name.name, namelen);
+ *--end =3D '/';
+ dentry =3D dentry->d_parent;
+ }
+ spin_unlock(&dcache_lock);
+ namelen =3D strlen(base);
+ /* Strip off excess slashes in base string */
+ while (namelen > 0 && base[namelen - 1] =3D=3D '/')
+ namelen--;
+ buflen -=3D namelen;
+ if (buflen < 0)
+ goto Elong;
+ end -=3D namelen;
+ memcpy(end, base, namelen);
+ return end;
+Elong:
+ return ERR_PTR(-ENAMETOOLONG);
+}
+
+struct nfs_clone_mount {
+ const struct super_block *sb;
+ const struct dentry *dentry;
+ struct nfs_fh *fh;
+ struct nfs_fattr *fattr;
+};
+
+static struct super_block *nfs_clone_generic_sb(struct nfs_clone_mount *=
data,
+ struct super_block *(*clone_client)(struct nfs_server *, struct nfs_cl=
one_mount *))
+{
+ struct nfs_server *server;
+ struct nfs_server *parent =3D NFS_SB(data->sb);
+ struct super_block *sb =3D ERR_PTR(-EINVAL);
+ void *err =3D ERR_PTR(-ENOMEM);
+ struct inode *root_inode;
+ struct nfs_fsinfo fsinfo;
+ int len;
+
+ server =3D kmalloc(sizeof(struct nfs_server), GFP_KERNEL);
+ if (server =3D=3D NULL)
+ goto out_err;
+ memcpy(server, parent, sizeof(*server));
+ len =3D strlen(parent->hostname) + 1;
+ server->hostname =3D kmalloc(len, GFP_KERNEL);
+ if (server->hostname =3D=3D NULL)
+ goto free_server;
+ memcpy(server->hostname, parent->hostname, len);
+ server->fsid =3D data->fattr->fsid;
+ nfs_copy_fh(&server->fh, data->fh);
+ if (rpciod_up() !=3D 0)
+ goto free_hostname;
+
+ sb =3D clone_client(server, data);
+ if (IS_ERR((err =3D sb)) || sb->s_root)
+ goto kill_rpciod;
+
+ sb->s_op =3D data->sb->s_op;
+ sb->s_blocksize =3D data->sb->s_blocksize;
+ sb->s_blocksize_bits =3D data->sb->s_blocksize_bits;
+ sb->s_maxbytes =3D data->sb->s_maxbytes;
+
+ server->client_sys =3D server->client_acl =3D ERR_PTR(-EINVAL);
+ err =3D ERR_PTR(-ENOMEM);
+ server->io_stats =3D nfs_alloc_iostats();
+ if (server->io_stats =3D=3D NULL)
+ goto out_deactivate;
+
+ server->client =3D rpc_clone_client(parent->client);
+ if (IS_ERR((err =3D server->client)))
+ goto out_deactivate;
+ if (!IS_ERR(parent->client_sys)) {
+ server->client_sys =3D rpc_clone_client(parent->client_sys);
+ if (IS_ERR((err =3D server->client_sys)))
+ goto out_deactivate;
+ }
+ if (!IS_ERR(parent->client_acl)) {
+ server->client_acl =3D rpc_clone_client(parent->client_acl);
+ if (IS_ERR((err =3D server->client_acl)))
+ goto out_deactivate;
+ }
+ root_inode =3D nfs_fhget(sb, data->fh, data->fattr);
+ if (!root_inode)
+ goto out_deactivate;
+ sb->s_root =3D d_alloc_root(root_inode);
+ if (!sb->s_root)
+ goto out_put_root;
+ fsinfo.fattr =3D data->fattr;
+ if (NFS_PROTO(root_inode)->fsinfo(server, data->fh, &fsinfo) =3D=3D 0)
+ nfs_super_set_maxbytes(sb, fsinfo.maxfilesize);
+ sb->s_root->d_op =3D server->rpc_ops->dentry_ops;
+ sb->s_flags |=3D MS_ACTIVE;
+ return sb;
+out_put_root:
+ iput(root_inode);
+out_deactivate:
+ up_write(&sb->s_umount);
+ deactivate_super(sb);
+ return (struct super_block *)err;
+kill_rpciod:
+ rpciod_down();
+free_hostname:
+ kfree(server->hostname);
+free_server:
+ kfree(server);
+out_err:
+ return (struct super_block *)err;
+}
=20
static int nfs_set_super(struct super_block *s, void *data)
{
@@ -1827,7 +1973,32 @@ static struct file_system_type nfs_fs_ty
.kill_sb =3D nfs_kill_super,
.fs_flags =3D FS_ODD_RENAME|FS_REVAL_DOT|FS_BINARY_MOUNTDATA,
};
+
+static struct super_block *nfs_clone_client(struct nfs_server *server, s=
truct nfs_clone_mount *data)
+{
+ struct super_block *sb;
=20
+ sb =3D sget(&nfs_fs_type, nfs_compare_super, nfs_set_super, server);
+ if (!IS_ERR(sb) && sb->s_root =3D=3D NULL && !(server->flags & NFS_MOUN=
T_NONLM))
+ lockd_up();
+ return sb;
+}
+
+static struct super_block *nfs_clone_nfs_sb(struct file_system_type *fs_=
type,
+ int flags, const char *dev_name, void *raw_data)
+{
+ struct nfs_clone_mount *data =3D raw_data;
+ return nfs_clone_generic_sb(data, nfs_clone_client);
+}
+
+static struct file_system_type clone_nfs_fs_type =3D {
+ .owner =3D THIS_MODULE,
+ .name =3D "nfs",
+ .get_sb =3D nfs_clone_nfs_sb,
+ .kill_sb =3D nfs_kill_super,
+ .fs_flags =3D FS_ODD_RENAME|FS_REVAL_DOT|FS_BINARY_MOUNTDATA,
+};
+
#ifdef CONFIG_NFS_V4
=20
static void nfs4_clear_inode(struct inode *);
@@ -2177,7 +2348,76 @@ static int param_set_idmap_timeout(const
=20
module_param_call(idmap_cache_timeout, param_set_idmap_timeout, param_ge=
t_int,
&nfs_idmap_cache_timeout, 0644);
+
+/* Constructs the SERVER-side path */
+static inline char *nfs4_path(const struct dentry *dentry, char *buffer,=
ssize_t buflen)
+{
+ return nfs_path(NFS_SB(dentry->d_sb)->mnt_path, dentry, buffer, buflen)=
;
+}
+
+static inline char *nfs4_dup_path(const struct dentry *dentry)
+{
+ char *page =3D (char *) __get_free_page(GFP_USER);
+ char *path;
=20
+ path =3D nfs4_path(dentry, page, PAGE_SIZE);
+ if (!IS_ERR(path)) {
+ int len =3D PAGE_SIZE + page - path;
+ char *tmp =3D path;
+
+ path =3D kmalloc(len, GFP_KERNEL);
+ if (path)
+ memcpy(path, tmp, len);
+ else
+ path =3D ERR_PTR(-ENOMEM);
+ }
+ free_page((unsigned long)page);
+ return path;
+}
+
+static struct super_block *nfs4_clone_client(struct nfs_server *server, =
struct nfs_clone_mount *data)
+{
+ const struct dentry *dentry =3D data->dentry;
+ struct nfs4_client *clp =3D server->nfs4_state;
+ struct super_block *sb;
+
+ server->mnt_path =3D nfs4_dup_path(dentry);
+ if (IS_ERR(server->mnt_path)) {
+ sb =3D (struct super_block *)server->mnt_path;
+ goto err;
+ }
+ sb =3D sget(&nfs4_fs_type, nfs4_compare_super, nfs_set_super, server);
+ if (IS_ERR(sb) || sb->s_root)
+ goto free_path;
+ nfs4_server_capabilities(server, &server->fh);
+
+ down_write(&clp->cl_sem);
+ atomic_inc(&clp->cl_count);
+ list_add_tail(&server->nfs4_siblings, &clp->cl_superblocks);
+ up_write(&clp->cl_sem);
+ return sb;
+free_path:
+ kfree(server->mnt_path);
+err:
+ server->mnt_path =3D NULL;
+ return sb;
+}
+
+static struct super_block *nfs_clone_nfs4_sb(struct file_system_type *fs=
_type,
+ int flags, const char *dev_name, void *raw_data)
+{
+ struct nfs_clone_mount *data =3D raw_data;
+ return nfs_clone_generic_sb(data, nfs4_clone_client);
+}
+
+static struct file_system_type clone_nfs4_fs_type =3D {
+ .owner =3D THIS_MODULE,
+ .name =3D "nfs",
+ .get_sb =3D nfs_clone_nfs4_sb,
+ .kill_sb =3D nfs4_kill_super,
+ .fs_flags =3D FS_ODD_RENAME|FS_REVAL_DOT|FS_BINARY_MOUNTDATA,
+};
+
#define nfs4_init_once(nfsi) \
do { \
INIT_LIST_HEAD(&(nfsi)->open_states); \
@@ -2205,11 +2445,68 @@ static inline void unregister_nfs4fs(voi
nfs_unregister_sysctl();
}
#else
+#define nfs4_clone_client(a,b) ERR_PTR(-EINVAL)
#define nfs4_init_once(nfsi) \
do { } while (0)
#define register_nfs4fs() (0)
#define unregister_nfs4fs()
#endif
+
+static inline char *nfs_devname(const struct vfsmount *mnt_parent,
+ const struct dentry *dentry,
+ char *buffer, ssize_t buflen)
+{
+ return nfs_path(mnt_parent->mnt_devname, dentry, buffer, buflen);
+}
+
+/**
+ * nfs_do_submount - set up mountpoint when crossing a filesystem bounda=
ry
+ * @mnt_parent - mountpoint of parent directory
+ * @dentry - parent directory
+ * @fh - filehandle for new root dentry
+ * @fattr - attributes for new root inode
+ *
+ */
+struct vfsmount *nfs_do_submount(const struct vfsmount *mnt_parent,
+ const struct dentry *dentry, struct nfs_fh *fh,
+ struct nfs_fattr *fattr)
+{
+ struct nfs_clone_mount mountdata =3D {
+ .sb =3D mnt_parent->mnt_sb,
+ .dentry =3D dentry,
+ .fh =3D fh,
+ .fattr =3D fattr,
+ };
+ struct vfsmount *mnt =3D ERR_PTR(-ENOMEM);
+ char *page =3D (char *) __get_free_page(GFP_USER);
+ char *devname;
+
+ dprintk("%s: submounting on %s/%s\n", __FUNCTION__,
+ dentry->d_parent->d_name.name,
+ dentry->d_name.name);
+ if (page =3D=3D NULL)
+ goto out;
+ devname =3D nfs_devname(mnt_parent, dentry, page, PAGE_SIZE);
+ mnt =3D (struct vfsmount *)devname;
+ if (IS_ERR(devname))
+ goto free_page;
+ switch (NFS_SB(mnt_parent->mnt_sb)->rpc_ops->version) {
+ case 2:
+ case 3:
+ mnt =3D vfs_kern_mount(&clone_nfs_fs_type, 0, devname, &mountdata);
+ break;
+ case 4:
+ mnt =3D vfs_kern_mount(&clone_nfs4_fs_type, 0, devname, &mountdata);
+ break;
+ default:
+ BUG();
+ }
+free_page:
+ free_page((unsigned long)page);
+out:
+ dprintk("%s: done\n", __FUNCTION__);
+ return mnt;
+}
=20
extern int nfs_init_nfspagecache(void);
extern void nfs_destroy_nfspagecache(void);
diff --git a/fs/nfs/namespace.c b/fs/nfs/namespace.c
new file mode 100644
index 0000000..a155505
--- /dev/null
+++ b/fs/nfs/namespace.c
@@ -0,0 +1,89 @@
+/*
+ * linux/fs/nfs/namespace.c
+ *
+ * Copyright (C) 2005 Trond Myklebust <[email protected]>
+ *
+ * NFS namespace
+ */
+
+#include <linux/config.h>
+
+#include <linux/dcache.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/nfs_fs.h>
+#include <linux/string.h>
+#include <linux/sunrpc/clnt.h>
+#include <linux/vfs.h>
+
+#define NFSDBG_FACILITY NFSDBG_VFS
+
+/*
+ * nfs_follow_mountpoint - handle crossing a mountpoint on the server
+ * @dentry - dentry of mountpoint
+ * @nd - nameidata info
+ *
+ * When we encounter a mountpoint on the server, we want to set up
+ * a mountpoint on the client too, to prevent inode numbers from
+ * colliding, and to allow "df" to work properly.
+ * On NFSv4, we also want to allow for the fact that different
+ * filesystems may be migrated to different servers in a failover
+ * situation, and that different filesystems may want to use
+ * different security flavours.
+ */
+static void * nfs_follow_mountpoint(struct dentry *dentry, struct nameid=
ata *nd)
+{
+ struct vfsmount *mnt;
+ struct nfs_server *server =3D NFS_SERVER(dentry->d_inode);
+ struct dentry *parent;
+ struct nfs_fh fh;
+ struct nfs_fattr fattr;
+ int err;
+
+ BUG_ON(IS_ROOT(dentry));
+ dprintk("%s: enter\n", __FUNCTION__);
+ dput(nd->dentry);
+ nd->dentry =3D dget(dentry);
+ if (d_mountpoint(nd->dentry))
+ goto out_follow;
+ /* Look it up again */
+ parent =3D dget_parent(nd->dentry);
+ err =3D server->rpc_ops->lookup(parent->d_inode, &nd->dentry->d_name, &=
fh, &fattr);
+ dput(parent);
+ if (err !=3D 0)
+ goto out_err;
+
+ mnt =3D nfs_do_submount(nd->mnt, nd->dentry, &fh, &fattr);
+ err =3D PTR_ERR(mnt);
+ if (IS_ERR(mnt))
+ goto out_err;
+
+ mntget(mnt);
+ err =3D do_add_mount(mnt, nd, nd->mnt->mnt_flags, NULL);
+ if (err < 0) {
+ mntput(mnt);
+ if (err =3D=3D -EBUSY)
+ goto out_follow;
+ goto out_err;
+ }
+ mntput(nd->mnt);
+ dput(nd->dentry);
+ nd->mnt =3D mnt;
+ nd->dentry =3D dget(mnt->mnt_root);
+out:
+ dprintk("%s: done, returned %d\n", __FUNCTION__, err);
+ return ERR_PTR(err);
+out_err:
+ path_release(nd);
+ goto out;
+out_follow:
+ while(d_mountpoint(nd->dentry) && follow_down(&nd->mnt, &nd->dentry))
+ ;
+ err =3D 0;
+ goto out;
+}
+
+struct inode_operations nfs_mountpoint_inode_operations =3D {
+ .follow_link =3D nfs_follow_mountpoint,
+ .getattr =3D nfs_getattr,
+};
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index 0f5e4e7..307832f 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -217,6 +217,7 @@ extern int nfs4_proc_renew(struct nfs4_c
extern int nfs4_do_close(struct inode *inode, struct nfs4_state *state);
extern struct dentry *nfs4_atomic_open(struct inode *, struct dentry *, =
struct nameidata *);
extern int nfs4_open_revalidate(struct inode *, struct dentry *, int, st=
ruct nameidata *);
+extern int nfs4_server_capabilities(struct nfs_server *server, struct nf=
s_fh *fhandle);
=20
extern struct nfs4_state_recovery_ops nfs4_reboot_recovery_ops;
extern struct nfs4_state_recovery_ops nfs4_network_partition_recovery_op=
s;
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 86f81a7..e108142 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -1331,7 +1331,7 @@ static int _nfs4_server_capabilities(str
return status;
}
=20
-static int nfs4_server_capabilities(struct nfs_server *server, struct nf=
s_fh *fhandle)
+int nfs4_server_capabilities(struct nfs_server *server, struct nfs_fh *f=
handle)
{
struct nfs4_exception exception =3D { };
int err;
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 83e2b8a..7cd75e0 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -308,6 +308,10 @@ extern void nfs_end_data_update(struct i
extern struct nfs_open_context *get_nfs_open_context(struct nfs_open_con=
text *ctx);
extern void put_nfs_open_context(struct nfs_open_context *ctx);
extern struct nfs_open_context *nfs_find_open_context(struct inode *inod=
e, struct rpc_cred *cred, int mode);
+extern struct vfsmount *nfs_do_submount(const struct vfsmount *mnt_paren=
t,
+ const struct dentry *dentry,
+ struct nfs_fh *fh,
+ struct nfs_fattr *fattr);
=20
/* linux/net/ipv4/ipconfig.c: trims ip addr off front of name, too. */
extern u32 root_nfs_parse_addr(char *name); /*__init*/
@@ -392,6 +396,11 @@ #else
#define nfs_register_sysctl() 0
#define nfs_unregister_sysctl() do { } while(0)
#endif
+
+/*
+ * linux/fs/nfs/namespace.c
+ */
+extern struct inode_operations nfs_mountpoint_inode_operations;
=20
/*
* linux/fs/nfs/unlink.c

2006-04-11 18:05:32

by Myklebust, Trond

[permalink] [raw]

Subject: RFC [PATCH 2/6] VFS: Add shrink_submounts()

From: Trond Myklebust <[email protected]>

Allow a submount to be marked as being 'shrinkable' by means of the
vfsmount->mnt_flags, and then add a function 'shrink_submounts()' which
attempts to recursively unmount these submounts.

Signed-off-by: Trond Myklebust <[email protected]>
---

fs/namespace.c | 124 +++++++++++++++++++++++++++++++++++++++----=
------
include/linux/mount.h | 3 +
2 files changed, 102 insertions(+), 25 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 2c5f1f8..7bff436 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1166,13 +1166,46 @@ static void expire_mount(struct vfsmount
}
=20
/*
+ * go through the vfsmounts we've just consigned to the graveyard to
+ * - check that they're still dead
+ * - delete the vfsmount from the appropriate namespace under lock
+ * - dispose of the corpse
+ */
+static void expire_mount_list(struct list_head *graveyard, struct list_h=
ead *mounts)
+{
+ struct namespace *namespace;
+ struct vfsmount *mnt;
+
+ while (!list_empty(graveyard)) {
+ LIST_HEAD(umounts);
+ mnt =3D list_entry(graveyard->next, struct vfsmount, mnt_expire);
+ list_del_init(&mnt->mnt_expire);
+
+ /* don't do anything if the namespace is dead - all the
+ * vfsmounts from it are going away anyway */
+ namespace =3D mnt->mnt_namespace;
+ if (!namespace || !namespace->root)
+ continue;
+ get_namespace(namespace);
+
+ spin_unlock(&vfsmount_lock);
+ down_write(&namespace_sem);
+ expire_mount(mnt, mounts, &umounts);
+ up_write(&namespace_sem);
+ release_mounts(&umounts);
+ mntput(mnt);
+ put_namespace(namespace);
+ spin_lock(&vfsmount_lock);
+ }
+}
+
+/*
* process a list of expirable mountpoints with the intent of discarding=
any
* mountpoints that aren't in use and haven't been touched since last we=
came
* here
*/
void mark_mounts_for_expiry(struct list_head *mounts)
{
- struct namespace *namespace;
struct vfsmount *mnt, *next;
LIST_HEAD(graveyard);
=20
@@ -1196,38 +1229,79 @@ void mark_mounts_for_expiry(struct list_
list_move(&mnt->mnt_expire, &graveyard);
}
=20
- /*
- * go through the vfsmounts we've just consigned to the graveyard to
- * - check that they're still dead
- * - delete the vfsmount from the appropriate namespace under lock
- * - dispose of the corpse
- */
- while (!list_empty(&graveyard)) {
- LIST_HEAD(umounts);
- mnt =3D list_entry(graveyard.next, struct vfsmount, mnt_expire);
- list_del_init(&mnt->mnt_expire);
+ expire_mount_list(&graveyard, mounts);
=20
- /* don't do anything if the namespace is dead - all the
- * vfsmounts from it are going away anyway */
- namespace =3D mnt->mnt_namespace;
- if (!namespace || !namespace->root)
+ spin_unlock(&vfsmount_lock);
+}
+
+EXPORT_SYMBOL_GPL(mark_mounts_for_expiry);
+
+/*
+ * Ripoff of 'select_parent()'
+ *
+ * search the list of submounts for a given mountpoint, and move any
+ * shrinkable submounts to the 'graveyard' list.
+ */
+static int select_submounts(struct vfsmount *parent, struct list_head *g=
raveyard)
+{
+ struct vfsmount *this_parent =3D parent;
+ struct list_head *next;
+ int found =3D 0;
+
+repeat:
+ next =3D this_parent->mnt_mounts.next;
+resume:
+ while (next !=3D &this_parent->mnt_mounts) {
+ struct list_head *tmp =3D next;
+ struct vfsmount *mnt =3D list_entry(tmp, struct vfsmount, mnt_child);
+
+ next =3D tmp->next;
+ if (!(mnt->mnt_flags & MNT_SHRINKABLE))
continue;
- get_namespace(namespace);
+ /*
+ * Descend a level if the d_mounts list is non-empty.
+ */
+ if (!list_empty(&mnt->mnt_mounts)) {
+ this_parent =3D mnt;
+ goto repeat;
+ }
=20
- spin_unlock(&vfsmount_lock);
- down_write(&namespace_sem);
- expire_mount(mnt, mounts, &umounts);
- up_write(&namespace_sem);
- release_mounts(&umounts);
- mntput(mnt);
- put_namespace(namespace);
- spin_lock(&vfsmount_lock);
+ if (!propagate_mount_busy(mnt, 1)) {
+ mntget(mnt);
+ list_move_tail(&mnt->mnt_expire, graveyard);
+ found++;
+ }
}
+ /*
+ * All done at this level ... ascend and resume the search
+ */
+ if (this_parent !=3D parent) {
+ next =3D this_parent->mnt_child.next;
+ this_parent =3D this_parent->mnt_parent;
+ goto resume;
+ }
+ return found;
+}
+
+/*
+ * process a list of expirable mountpoints with the intent of discarding=
any
+ * submounts of a specific parent mountpoint
+ */
+void shrink_submounts(struct vfsmount *mountpoint, struct list_head *mou=
nts)
+{
+ LIST_HEAD(graveyard);
+ int found;
=20
+ spin_lock(&vfsmount_lock);
+
+ /* extract submounts of 'mountpoint' from the expiration list */
+ while ((found =3D select_submounts(mountpoint, &graveyard)) !=3D 0)
+ expire_mount_list(&graveyard, mounts);
+
spin_unlock(&vfsmount_lock);
}
=20
-EXPORT_SYMBOL_GPL(mark_mounts_for_expiry);
+EXPORT_SYMBOL_GPL(shrink_submounts);
=20
/*
* Some copy_from_user() implementations do not return the exact number =
of
diff --git a/include/linux/mount.h b/include/linux/mount.h
index aff68c3..9b4e007 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -23,6 +23,8 @@ #define MNT_NOEXEC 0x04
#define MNT_NOATIME 0x08
#define MNT_NODIRATIME 0x10
=20
+#define MNT_SHRINKABLE 0x100
+
#define MNT_SHARED 0x1000 /* if the vfsmount is a shared mount */
#define MNT_UNBINDABLE 0x2000 /* if the vfsmount is a unbindable mount *=
/
#define MNT_PNODE_MASK 0x3000 /* propogation flag mask */
@@ -84,6 +86,7 @@ extern int do_add_mount(struct vfsmount=20
int mnt_flags, struct list_head *fslist);
=20
extern void mark_mounts_for_expiry(struct list_head *mounts);
+extern void shrink_submounts(struct vfsmount *mountpoint, struct list_he=
ad *mounts);
=20
extern spinlock_t vfsmount_lock;
extern dev_t name_to_dev_t(char *name);

2006-04-11 18:05:37

by Myklebust, Trond

[permalink] [raw]

Subject: RFC [PATCH 4/6] NFS: Store the file system "fsid" value in the NFS super block.

From: Trond Myklebust <[email protected]>

This should enable us to detect if we are crossing a mountpoint in the
case where the server is exporting "nohide" mounts.

Signed-off-by: Trond Myklebust <[email protected]>
---

fs/nfs/idmap.c | 1 -
fs/nfs/inode.c | 8 ++++++++
fs/nfs/nfs2xdr.c | 3 ++-
fs/nfs/nfs3xdr.c | 3 ++-
fs/nfs/nfs4xdr.c | 4 ++--
include/linux/nfs_fs.h | 5 +++--
include/linux/nfs_fs_sb.h | 1 +
include/linux/nfs_page.h | 1 -
include/linux/nfs_xdr.h | 19 ++++++++++++-------
9 files changed, 30 insertions(+), 15 deletions(-)

diff --git a/fs/nfs/idmap.c b/fs/nfs/idmap.c
index 3fab5b0..b81e7ed 100644
--- a/fs/nfs/idmap.c
+++ b/fs/nfs/idmap.c
@@ -47,7 +47,6 @@ #include <linux/sunrpc/clnt.h>
#include <linux/workqueue.h>
#include <linux/sunrpc/rpc_pipe_fs.h>
=20
-#include <linux/nfs_fs_sb.h>
#include <linux/nfs_fs.h>
=20
#include <linux/nfs_idmap.h>
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index cfcc585..bf9d404 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -253,6 +253,7 @@ nfs_get_root(struct super_block *sb, str
return ERR_PTR(error);
}
=20
+ server->fsid =3D fsinfo->fattr->fsid;
return nfs_fhget(sb, rootfh, fsinfo->fattr);
}
=20
@@ -1514,6 +1515,7 @@ out:
*/
static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr=
)
{
+ struct nfs_server *server;
struct nfs_inode *nfsi =3D NFS_I(inode);
loff_t cur_isize, new_isize;
unsigned int invalid =3D 0;
@@ -1531,6 +1533,12 @@ static int nfs_update_inode(struct inode
*/
if ((inode->i_mode & S_IFMT) !=3D (fattr->mode & S_IFMT))
goto out_changed;
+
+ server =3D NFS_SERVER(inode);
+ /* Update the fsid if and only if this is the root directory */
+ if (inode =3D=3D inode->i_sb->s_root->d_inode
+ && !nfs_fsid_equal(&server->fsid, &fattr->fsid))
+ server->fsid =3D fattr->fsid;
=20
/*
* Update the read time so we don't revalidate too often.
diff --git a/fs/nfs/nfs2xdr.c b/fs/nfs/nfs2xdr.c
index f0015fa..a7ed88f 100644
--- a/fs/nfs/nfs2xdr.c
+++ b/fs/nfs/nfs2xdr.c
@@ -131,7 +131,8 @@ xdr_decode_fattr(u32 *p, struct nfs_fatt
fattr->du.nfs2.blocksize =3D ntohl(*p++);
rdev =3D ntohl(*p++);
fattr->du.nfs2.blocks =3D ntohl(*p++);
- fattr->fsid_u.nfs3 =3D ntohl(*p++);
+ fattr->fsid.major =3D ntohl(*p++);
+ fattr->fsid.minor =3D 0;
fattr->fileid =3D ntohl(*p++);
p =3D xdr_decode_time(p, &fattr->atime);
p =3D xdr_decode_time(p, &fattr->mtime);
diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c
index ec23361..f70eee2 100644
--- a/fs/nfs/nfs3xdr.c
+++ b/fs/nfs/nfs3xdr.c
@@ -166,7 +166,8 @@ xdr_decode_fattr(u32 *p, struct nfs_fatt
if (MAJOR(fattr->rdev) !=3D major || MINOR(fattr->rdev) !=3D minor)
fattr->rdev =3D 0;
=20
- p =3D xdr_decode_hyper(p, &fattr->fsid_u.nfs3);
+ p =3D xdr_decode_hyper(p, &fattr->fsid.major);
+ fattr->fsid.minor =3D 0;
p =3D xdr_decode_hyper(p, &fattr->fileid);
p =3D xdr_decode_time3(p, &fattr->atime);
p =3D xdr_decode_time3(p, &fattr->mtime);
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index 7c5d70e..7270d12 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -2217,7 +2217,7 @@ static int decode_attr_symlink_support(s
return 0;
}
=20
-static int decode_attr_fsid(struct xdr_stream *xdr, uint32_t *bitmap, st=
ruct nfs4_fsid *fsid)
+static int decode_attr_fsid(struct xdr_stream *xdr, uint32_t *bitmap, st=
ruct nfs_fsid *fsid)
{
uint32_t *p;
=20
@@ -2863,7 +2863,7 @@ static int decode_getfattr(struct xdr_st
goto xdr_error;
if ((status =3D decode_attr_size(xdr, bitmap, &fattr->size)) !=3D 0)
goto xdr_error;
- if ((status =3D decode_attr_fsid(xdr, bitmap, &fattr->fsid_u.nfs4)) !=3D=
0)
+ if ((status =3D decode_attr_fsid(xdr, bitmap, &fattr->fsid)) !=3D 0)
goto xdr_error;
if ((status =3D decode_attr_fileid(xdr, bitmap, &fattr->fileid)) !=3D 0=
)
goto xdr_error;
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index c71227d..83e2b8a 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -16,8 +16,6 @@ #include <linux/pagemap.h>
#include <linux/rwsem.h>
#include <linux/wait.h>
=20
-#include <linux/nfs_fs_sb.h>
-
#include <linux/sunrpc/debug.h>
#include <linux/sunrpc/auth.h>
#include <linux/sunrpc/clnt.h>
@@ -27,6 +25,9 @@ #include <linux/nfs2.h>
#include <linux/nfs3.h>
#include <linux/nfs4.h>
#include <linux/nfs_xdr.h>
+
+#include <linux/nfs_fs_sb.h>
+
#include <linux/rwsem.h>
#include <linux/mempool.h>
=20
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 65dec21..6b4a13c 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -35,6 +35,7 @@ struct nfs_server {
char * hostname; /* remote hostname */
struct nfs_fh fh;
struct sockaddr_in addr;
+ struct nfs_fsid fsid;
unsigned long mount_time; /* when this fs was mounted */
#ifdef CONFIG_NFS_V4
/* Our own IP address, as a null-terminated string.
diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
index 66e2ed6..4cee1f8 100644
--- a/include/linux/nfs_page.h
+++ b/include/linux/nfs_page.h
@@ -13,7 +13,6 @@ #define _LINUX_NFS_PAGE_H
#include <linux/list.h>
#include <linux/pagemap.h>
#include <linux/wait.h>
-#include <linux/nfs_fs_sb.h>
#include <linux/sunrpc/auth.h>
#include <linux/nfs_xdr.h>
=20
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index c483e23..906c462 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -14,11 +14,19 @@ #define NFS_MAX_FILE_IO_SIZE (1048576U)
#define NFS_DEF_FILE_IO_SIZE (4096U)
#define NFS_MIN_FILE_IO_SIZE (1024U)
=20
-struct nfs4_fsid {
- __u64 major;
- __u64 minor;
+struct nfs_fsid {
+ uint64_t major;
+ uint64_t minor;
};
=20
+/*
+ * Helper for checking equality between 2 fsids.
+ */
+static inline int nfs_fsid_equal(const struct nfs_fsid *a, const struct =
nfs_fsid *b)
+{
+ return a->major =3D=3D b->major && a->minor =3D=3D b->minor;
+}
+
struct nfs_fattr {
unsigned short valid; /* which fields are valid */
__u64 pre_size; /* pre_op_attr.size */
@@ -40,10 +48,7 @@ struct nfs_fattr {
} nfs3;
} du;
dev_t rdev;
- union {
- __u64 nfs3; /* also nfs2 */
- struct nfs4_fsid nfs4;
- } fsid_u;
+ struct nfs_fsid fsid;
__u64 fileid;
struct timespec atime;
struct timespec mtime;

2006-04-11 18:05:30

by Myklebust, Trond

[permalink] [raw]

Subject: RFC [PATCH 1/6] VFS: Add GPL_EXPORTED function vfs_kern_mount()

From: Trond Myklebust <[email protected]>

do_kern_mount() does not allow the kernel to use private mount interfaces
without exposing the same interfaces to userland. The problem is that the
filesystem is referenced by name, thus meaning that it and its mount
interface must be registered in the global filesystem list.

vfs_kern_mount() passes the struct file_system_type as an explicit
parameter in order to overcome this limitation.

Signed-off-by: Trond Myklebust <[email protected]>
---

fs/super.c | 22 +++++++++++++++-------
include/linux/mount.h | 5 +++++
2 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index a66f66b..848be4f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -800,16 +800,12 @@ struct super_block *get_sb_single(struct
EXPORT_SYMBOL(get_sb_single);
=20
struct vfsmount *
-do_kern_mount(const char *fstype, int flags, const char *name, void *dat=
a)
+vfs_kern_mount(struct file_system_type *type, int flags, const char *nam=
e, void *data)
{
- struct file_system_type *type =3D get_fs_type(fstype);
struct super_block *sb =3D ERR_PTR(-ENOMEM);
struct vfsmount *mnt;
int error;
char *secdata =3D NULL;
-
- if (!type)
- return ERR_PTR(-ENODEV);
=20
mnt =3D alloc_vfsmnt(name);
if (!mnt)
@@ -841,7 +837,6 @@ do_kern_mount(const char *fstype, int fl
mnt->mnt_parent =3D mnt;
up_write(&sb->s_umount);
free_secdata(secdata);
- put_filesystem(type);
return mnt;
out_sb:
up_write(&sb->s_umount);
@@ -852,8 +847,21 @@ out_free_secdata:
out_mnt:
free_vfsmnt(mnt);
out:
- put_filesystem(type);
return (struct vfsmount *)sb;
+}
+
+EXPORT_SYMBOL_GPL(vfs_kern_mount);
+
+struct vfsmount *
+do_kern_mount(const char *fstype, int flags, const char *name, void *dat=
a)
+{
+ struct file_system_type *type =3D get_fs_type(fstype);
+ struct vfsmount *mnt;
+ if (!type)
+ return ERR_PTR(-ENODEV);
+ mnt =3D vfs_kern_mount(type, flags, name, data);
+ put_filesystem(type);
+ return mnt;
}
=20
EXPORT_SYMBOL_GPL(do_kern_mount);
diff --git a/include/linux/mount.h b/include/linux/mount.h
index b7472ae..aff68c3 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -73,6 +73,11 @@ extern struct vfsmount *alloc_vfsmnt(con
extern struct vfsmount *do_kern_mount(const char *fstype, int flags,
const char *name, void *data);
=20
+struct file_system_type;
+extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,
+ int flags, const char *name,
+ void *data);
+
struct nameidata;
=20
extern int do_add_mount(struct vfsmount *newmnt, struct nameidata *nd,

2007-05-24 01:16:55

by Erez Zadok

[permalink] [raw]

Subject: possible bug/oops in nfs_pageio_add_request (2.6.22-rc2)?

I've hit a NULL ptr deref on desc->pg_error below, triggered when mounting a
stackable file system on top of nfsv3:

// from file: nfs/pagelist.c
int nfs_pageio_add_request(struct nfs_pageio_descriptor *desc,
struct nfs_page *req)
{
while (!nfs_pageio_do_add_request(desc, req)) {
nfs_pageio_doio(desc);
if (desc->pg_error < 0)

Scenario:

2.6.22-rc2 with Unionfs 2.0 (release u2 for 2.6.22-rc2, which includes mmap
support).

I mount unionfs on top of nfs (v3). I have one file in the nfs branch. I
run a simple program through the union which mmap's the file, changes the
first byte of the file, calls msync(), and then closes. This causes
unionfs_writepage to be invoked, which in turn calls the lower file system's
->writepage, here nfs_writepage.

The 'wbc' that's passed to unionfs_writepage from the VFS has this:

wbc->for_writepages = 1
wbc->fs_private = NULL

If you follow the logic, then nfs_writepage calls nfs_writepage_locked,
passing the same wbc. nfs_writepage_locked does this:

if (wbc->for_writepages)
pgio = wbc->fs_private;
else {
nfs_pageio_init_write(&mypgio, inode, wb_priority(wbc));
pgio = &mypgio;
}

which means that pgio is set to NULL from the caller's wbc. Then
nfs_writepage_locked calls nfs_page_async_flush, passing it this pgio
(NULL). nfs_page_async_flush invokes nfs_pageio_add_request, passing it
this NULL pgio. Inside nfs_pageio_add_request the NULL is being
dereferenced as desc->pg_error and we get an oops.

As a workaround, in unionfs_writepage I tried this before calling the lower
file system's ->writepage (which was nfs_writepage):

struct writeback_control lower_wbc;
memcpy(&lower_wbc, wbc, sizeof(struct writeback_control));
if (lower_wbc.for_writepages && !lower_wbc.fs_private) {
printk("unionfs: setting wbc.for_writepages to 0\n");
lower_wbc.for_writepages = 0;
}

Then I passed &lower_wbc to the lower file system's writepage method
(nfs_writepage). It works; no oops, and the file in question was sync'ed to
the backing f/s too. But I'm not sure if it's the correct workaround and
whether it'd break things for other non-NFS file systems.

It's possible that I'm doing something wrong in unionfs's mmap code, which
indirectly results in a malformed wbc structure being passed to unionfs (by
malformed I mean that wbc->fs_private is NULL and wbc->for_writepages is set
to 1). If such a wbc can be created by any other means and passed to NFS,
then nfs probably will continue to oops even w/o unionfs.

FWIW, I tried a similar scenario with eCryptfs (another stackable f/s in
2.6.22-rc2) on top of NFSv3, and got the same oops (sorry, Mike :-)

Any pointers would be appreciated.

Thanks,
Erez.

2007-05-24 12:51:40

by Trond Myklebust

[permalink] [raw]

Subject: Re: possible bug/oops in nfs_pageio_add_request (2.6.22-rc2)?

On Wed, 2007-05-23 at 21:16 -0400, Erez Zadok wrote:
> I've hit a NULL ptr deref on desc->pg_error below, triggered when mounting a
> stackable file system on top of nfsv3:
>
> // from file: nfs/pagelist.c
> int nfs_pageio_add_request(struct nfs_pageio_descriptor *desc,
> struct nfs_page *req)
> {
> while (!nfs_pageio_do_add_request(desc, req)) {
> nfs_pageio_doio(desc);
> if (desc->pg_error < 0)
>
> Scenario:
>
> 2.6.22-rc2 with Unionfs 2.0 (release u2 for 2.6.22-rc2, which includes mmap
> support).
>
> I mount unionfs on top of nfs (v3). I have one file in the nfs branch. I
> run a simple program through the union which mmap's the file, changes the
> first byte of the file, calls msync(), and then closes. This causes
> unionfs_writepage to be invoked, which in turn calls the lower file system's
> ->writepage, here nfs_writepage.
>
> The 'wbc' that's passed to unionfs_writepage from the VFS has this:
>
> wbc->for_writepages = 1
> wbc->fs_private = NULL
>
> If you follow the logic, then nfs_writepage calls nfs_writepage_locked,
> passing the same wbc. nfs_writepage_locked does this:
>
> if (wbc->for_writepages)
> pgio = wbc->fs_private;
> else {
> nfs_pageio_init_write(&mypgio, inode, wb_priority(wbc));
> pgio = &mypgio;
> }
>
> which means that pgio is set to NULL from the caller's wbc. Then
> nfs_writepage_locked calls nfs_page_async_flush, passing it this pgio
> (NULL). nfs_page_async_flush invokes nfs_pageio_add_request, passing it
> this NULL pgio. Inside nfs_pageio_add_request the NULL is being
> dereferenced as desc->pg_error and we get an oops.
>
> As a workaround, in unionfs_writepage I tried this before calling the lower
> file system's ->writepage (which was nfs_writepage):
>
> struct writeback_control lower_wbc;
> memcpy(&lower_wbc, wbc, sizeof(struct writeback_control));
> if (lower_wbc.for_writepages && !lower_wbc.fs_private) {
> printk("unionfs: setting wbc.for_writepages to 0\n");
> lower_wbc.for_writepages = 0;
> }
>
> Then I passed &lower_wbc to the lower file system's writepage method
> (nfs_writepage). It works; no oops, and the file in question was sync'ed to
> the backing f/s too. But I'm not sure if it's the correct workaround and
> whether it'd break things for other non-NFS file systems.
>
> It's possible that I'm doing something wrong in unionfs's mmap code, which
> indirectly results in a malformed wbc structure being passed to unionfs (by
> malformed I mean that wbc->fs_private is NULL and wbc->for_writepages is set
> to 1). If such a wbc can be created by any other means and passed to NFS,
> then nfs probably will continue to oops even w/o unionfs.
>
> FWIW, I tried a similar scenario with eCryptfs (another stackable f/s in
> 2.6.22-rc2) on top of NFSv3, and got the same oops (sorry, Mike :-)
>
> Any pointers would be appreciated.

If this is truly a call to ->writepages() by the VFS (as opposed to a
call to ->writepage()) then why is unionfs' writepages() failing to call
the underlying writepages method of the host filesystem: in this case
nfs_writepages()?

Trond